[00:11:17] !log Doc to apply a change to Zuul / reload it is at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Changing_the_configuration (logging there so people find it) [00:11:25] Logged the message, Master [00:12:28] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [00:13:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [01:01:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 311 seconds [01:14:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.562 seconds [01:15:37] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:20:07] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL - No data received from host [01:49:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [02:17:25] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:20:25] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:25:31] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 184 seconds [02:25:40] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 189 seconds [02:27:17] !log LocalisationUpdate completed (1.21wmf6) at Fri Dec 21 02:27:16 UTC 2012 [02:27:27] Logged the message, Master [02:28:31] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [02:28:31] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 206 seconds [02:29:07] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 226 seconds [02:30:46] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 183 seconds [02:30:55] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: CRIT replication delay 185 seconds [02:31:31] PROBLEM - MySQL Slave Delay on db1041 is CRITICAL: CRIT replication delay 195 seconds [02:31:49] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 201 seconds [02:32:07] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 205 seconds [02:40:13] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Fri Dec 21 02:39:57 UTC 2012 [02:41:34] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Fri Dec 21 02:41:20 UTC 2012 [02:59:43] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Fri Dec 21 02:59:32 UTC 2012 [03:02:34] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay 14 seconds [03:03:01] RECOVERY - MySQL Slave Delay on db1041 is OK: OK replication delay 0 seconds [03:03:19] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [03:03:37] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [03:03:55] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [03:06:28] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:19:13] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:19:13] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [03:30:01] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [03:30:28] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [06:37:11] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [06:37:11] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:37:12] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [06:37:12] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:37:12] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [06:37:12] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [07:54:39] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Fri Dec 21 07:54:13 UTC 2012 [07:56:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:56:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:04:42] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:04:42] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:04:43] RECOVERY - swift-object-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:05:28] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:05:28] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:05:28] RECOVERY - swift-container-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:05:36] RECOVERY - swift-object-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:05:45] RECOVERY - swift-object-auditor on ms-be3 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:05:45] RECOVERY - swift-account-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:05:54] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:06:03] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:06:27] hmm? [08:06:32] apergos: that you? [08:07:40] yes [08:08:29] you'll prolly see a few more message before I'm through, just ignore 'em [08:10:24] RECOVERY - swift-account-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:11:33] okay, good :) [08:12:58] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:14:54] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:42] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [08:20:54] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:22:33] how's swift? [08:23:01] fine, I have to deal with the changed disk layout [08:25:17] hm? [08:25:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:25:55] sdm/n3 etc [08:27:51] I'm about to push netboot.cfg [08:28:05] want me to switch ms-be3 to ms-be.cfg? [08:28:26] oh noes, it's an ssd one [08:28:32] okay, I'll leave it to you then [08:28:56] it's jut the ring switch I'm working on now [08:29:07] thanks for offering but it's all good [08:30:39] New patchset: Faidon; "partman: fix entries for eqiad ms-fes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39734 [08:31:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39734 [08:49:35] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [08:49:35] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [08:53:02] PROBLEM - Memcached on ms-fe1002 is CRITICAL: Connection refused [08:53:56] PROBLEM - SSH on ms-fe1001 is CRITICAL: Connection refused [08:54:05] PROBLEM - SSH on ms-fe1002 is CRITICAL: Connection refused [08:54:14] PROBLEM - Memcached on ms-fe1001 is CRITICAL: Connection refused [08:54:59] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [08:58:48] !log maxsem synchronized php-1.21wmf6/extensions/GeoData/ 'Update GeoData to master to fix job problems' [08:58:57] Logged the message, Master [08:59:49] !log maxsem synchronized php-1.21wmf5/extensions/GeoData/ 'Update GeoData to master to fix job problems' [08:59:57] Logged the message, Master [09:05:14] New patchset: MaxSem; "Switch GeoData to solr1001" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39737 [09:06:32] RECOVERY - SSH on ms-fe1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:06:36] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39737 [09:07:44] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [09:08:11] RECOVERY - SSH on ms-fe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:08:20] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 195 seconds [09:10:34] !log Populating solr1001 with enwiki coordinates [09:10:44] Logged the message, Master [09:12:41] PROBLEM - NTP on ms-fe1001 is CRITICAL: NTP CRITICAL: No response from NTP server [09:15:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.556 seconds [09:20:49] apergos or paravoid, could you look something up for me? what is the error in /var/log/jetty/2012_12_21.stderrout.log on solr1001? [09:21:39] there's a lot of stuff in there MaxSem [09:22:04] search for exceptions [09:22:39] root@solr1001:~# grep -i exception /var/log/jetty/2012_12_21.stderrout.log [09:22:39] root@solr1001:~# [09:22:57] nothing with 'error' in it either [09:23:39] mmm, then there was a connectivity error [09:23:42] thanks! [09:23:44] sure [09:29:56] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:30:23] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:32:47] PROBLEM - NTP on ms-fe1002 is CRITICAL: NTP CRITICAL: Offset unknown [09:35:36] !log maxsem synchronized wmf-config/CommonSettings.php 'Switch GeoData to solr1001' [09:35:44] Logged the message, Master [09:37:41] afk for a little (errand) [09:50:32] RECOVERY - NTP on ms-fe1002 is OK: NTP OK: Offset -0.03483080864 secs [09:50:32] RECOVERY - NTP on ms-fe1001 is OK: NTP OK: Offset -0.03293740749 secs [09:52:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:55:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [09:57:29] New patchset: MaxSem; "Take yttrium out of rotation, everything has been moved to the new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39739 [10:12:58] New patchset: Hashar; "rename some files to use '-labs' instead of '-wmflabs'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39059 [10:12:58] New patchset: Hashar; "file included by CS.php now uses -labs instead of -wmflabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39060 [10:20:03] MaxSem: are you still deploying anything ? [10:20:08] MaxSem: going to push two changes for labs [10:20:17] hashar, go ahead [10:20:21] danke [10:22:48] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39059 [10:23:03] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39060 [10:23:52] !log Deploying MediaWiki config changes for labs, basically renaming files from -wmflabs to -labs. {{gerrit|39059}} and {{gerrit|39060}} [10:24:00] Logged the message, Master [10:28:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:30:44] syncing [10:31:59] !log hashar synchronized wmf-config [10:32:08] Logged the message, Master [10:32:37] !log hashar synchronized all-labs.dblist [10:32:46] Logged the message, Master [10:38:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.809 seconds [10:56:27] New patchset: Hashar; "beta: disable wmgAddWikiNotify" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39742 [11:01:59] New patchset: Hashar; "beta: rm global $wmfRealm before including IS-labs.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39744 [11:04:12] New patchset: Hashar; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [11:05:25] New review: Hashar; "rebased again." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32167 [11:13:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:23] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 197 seconds [11:23:32] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 202 seconds [11:24:08] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 198 seconds [11:24:17] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 201 seconds [11:29:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [11:31:20] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [11:31:29] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [11:32:05] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [11:32:23] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:35:42] New patchset: Dereckson; "(bug 42288) Babel configuration for sv.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39747 [11:38:03] New patchset: Dereckson; "(bug 42288) Babel configuration for sv.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39747 [12:01:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [12:18:53] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [12:21:53] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:29:59] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [12:49:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.728 seconds [13:07:34] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:35:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39742 [13:36:05] !log hashar synchronized wmf-config/CommonSettings-labs.php 'beta: {{gerrit|39742}}' [13:36:14] Logged the message, Master [13:38:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:44] !log Zuul pipeline gate-and-submit will now add a message in Gerrit when it start. Would let people know the jobs have started after they voted CR+2 {{gerrit|39584}} [13:42:52] Logged the message, Master [13:46:37] !log Created GeoData tables for wikis in special.dblist [13:46:46] Logged the message, Master [13:52:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [13:54:20] New patchset: MaxSem; "Enable GeoData on wikiedias and explicitly on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39757 [13:55:55] woo [13:57:28] paravoid: woo what ? :D [13:58:56] !log Load-testing geosearch [13:59:04] Logged the message, Master [14:01:14] new feature [14:12:35] wyhat feature? [14:16:54] !log Load test passed successfully [14:17:02] Logged the message, Master [14:17:55] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39757 [14:18:41] !log Enabling whitespace checking on mediawiki-core-lint [14:18:49] Logged the message, Master [14:20:24] hashar, making Jenkins troll for you?:P [14:20:32] yup :-] [14:20:55] annnnd that does not work :-] [14:23:01] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable GeoData on wikiedias and explicitly on test2' [14:23:09] Logged the message, Master [14:23:22] * MaxSem waits for explosion [14:25:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:05] MaxSem: https://integration.mediawiki.org/ci/job/mediawiki-core-lint/2204/console [14:27:10] MaxSem: with color output :-]]]]]] [14:27:49] hashar, congrat - and I couldn't even break wikipedia. dammit! [14:28:36] hashar, will it be non-binding? [14:29:18] MaxSem: what do you mean by non-binding ? [14:31:23] what if it has w/s problems? [14:32:06] ahh [14:32:07] then we will find out a solution :-] [14:41:20] New patchset: MaxSem; "Try GeoData jobs on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39763 [14:42:38] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39763 [14:42:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [14:43:52] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Try GeoData jobs on enwiki' [14:44:00] Logged the message, Master [14:53:41] New patchset: MaxSem; "Enable GeoData jobs everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39765 [14:54:07] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39765 [14:55:42] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable GeoData jobs everywhere' [14:55:50] Logged the message, Master [15:03:29] New review: Demon; "Setting to the worst possible score so gerrit stops trying to merge this and I can stop being spammed." [operations/mediawiki-config] (master); V: -1 C: -2; - https://gerrit.wikimedia.org/r/38063 [15:04:22] ^demon, just abandon and restore;) [15:04:34] <^demon> Yeah, that'll work too. [15:04:45] Change abandoned: Demon; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38063 [15:04:58] Change restored: Demon; "Ok, now. Let's not submit this until the parents are ready." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38063 [15:05:34] <^demon> MaxSem: 2.6 lets that interval to retry to be configured. I'm going to set it to like...never. [15:06:06] <^demon> I think it's a brain-dead feature and fails more often than it works. [15:06:07] it just needs not to spam [15:06:18] why not retry quietly:) [15:06:33] <^demon> Well, supposedly it does. [15:07:01] <^demon> The code makes it look like if $message == $previousMessage && $state == $previousState, skip posting the message. [15:15:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:10] New patchset: Hashar; "move PHP linter to a new `wmfscripts` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [15:31:17] New patchset: Hashar; "Contint requires a 512M tmpfs file fs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35159 [15:31:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [15:37:18] apergos: if still on duty, would you mind merging in a basic conf file in ops/puppet please ? https://gerrit.wikimedia.org/r/#/c/39049/ [15:37:25] looking [15:37:27] apergos: that tweak the python linting configuration. [15:37:40] would make it lint our *.py.erb files :-] [15:37:57] New patchset: Ottomata; "Fixing eventlogging rsync so that there is an rsync module set up on vanadium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39770 [15:38:00] does not need any puppet to be run anywhere :) [15:38:02] hope none of those lint rules will break on the template files [15:38:26] little bits of ruby in there or whatever [15:38:26] apergos: they will broke but the jenkins job is not preventing patch submission :-) [15:38:46] apergos: yeah I was worrying about it, seems pep8 works fine with it [15:38:50] ok [15:38:52] we'll see [15:40:05] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39049 [15:40:10] \O/ [15:40:11] New patchset: Ottomata; "Fixing eventlogging rsync so that there is an rsync module set up on vanadium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39770 [15:40:44] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39770 [15:42:12] apergos: works like a charm :-] [15:42:21] nice [15:57:51] New review: Hashar; "Also note that wmf-config/CommonSettings.php has some changes related to central auth. They seem fin..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32167 [16:04:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:17:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.707 seconds [16:28:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [16:37:56] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:37:56] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:37:57] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [16:37:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [16:37:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [16:37:57] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [16:51:20] https://bugzilla.wikimedia.org/show_bug.cgi?id=40910#c5 [16:51:25] Surely someone here knows something about this? [16:51:57] It's been in the repo two months now but has clearly not been deployed to blog.wm.o [16:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:19] New patchset: Demon; "Revert "Kill mobileRedirect.php, not used since forever"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39775 [16:56:45] <^demon> Krenair: I'm not sure how blog changes are pushed--guillom would know. [17:08:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.015 seconds [17:18:02] New patchset: Ottomata; "Enabling 4 accounts on stat1. See RT 4106" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39776 [17:18:52] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39776 [17:21:26] afk for most of the evening I think [17:21:33] see folks off and on tomorrow (mostly off) [17:26:14] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:41] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:50] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:01] New patchset: Ottomata; "Fixing mediawiki .git url in class misc::statistics::mediawiki." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39779 [17:29:21] New patchset: Ottomata; "Fixing mediawiki .git url in class misc::statistics::mediawiki." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39779 [17:29:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39779 [17:40:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:01] New review: Dereckson; "I opened several .rb files in operations/puppet repo and can't find a comment standard." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/39342 [17:54:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.577 seconds [17:57:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:57:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:09:08] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:29:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:19] New patchset: Aaron Schulz; "Switched all wikis to swift for captchas." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39786 [18:38:32] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39786 [18:39:13] !log aaron synchronized wmf-config/CommonSettings.php 'Switched all wikis to swift for captchas' [18:39:21] Logged the message, Master [18:39:50] paravoid: ^ [18:39:57] yay! [18:40:31] kudos [18:45:21] !log re-enabled password protection on ganglia [18:45:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [18:45:29] Logged the message, Master [18:57:17] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 215 seconds [18:57:35] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 224 seconds [19:02:05] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [19:02:23] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [19:12:55] ^demon: is the hume key thing resolved? [19:13:28] s/key/fingerprint/ [19:14:40] <^demon> notpeter: I didn't have any problems logging in yesterday. [19:14:43] <^demon> I can verify again now. [19:14:47] sure [19:15:03] <^demon> Yep, still works fine :) [19:15:06] cool! [19:15:08] problem solved [19:15:21] <^demon> Also, is removing git a purposeful change, or an oversight? [19:15:29] wha? [19:15:46] no idea what you're talking about, so oversight :) [19:15:59] <^demon> Hmm, `git` seems to be back (it was definitely missing) [19:16:09] someone probably installed by hand, then [19:16:12] <^demon> But I'm still getting that __git_ps1 can't be found. [19:16:20] huhu [19:16:46] Damn you ganglia [19:17:52] <^demon> notpeter: I mean, it's not like I'll die without __git_ps1, but it does...well, break my $PS1 ;-) [19:18:19] I just still have no idea why you need an old-school play station [19:18:22] can't even run linux [19:18:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:54] <^demon> notpeter: har har har :p [19:21:03] notpeter: "there is (was) distro for the PS1 called Runix." [19:21:12] notpeter: Neither can the PS3 anymore ;) [19:21:24] New patchset: Ottomata; "Monitoring analytics webrequest udp2log instances." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39790 [19:21:55] <^demon> I've got a Wii that won't boot. I've thought about trying to get some linux running on it. [19:22:17] <^demon> Haven't tried though--could be a hardware problem rendering the whole project an exercise in futility. [19:22:20] I've thought about using the Wii controller as an input device on my Linux PC [19:22:40] <^demon> Code review using a Wii controller? Sign. Me. Up. [19:22:52] <^demon> Crappy patch? Slash all over the screen like in zelda. [19:23:00] hehe, nice [19:23:16] "LGTM. Approved" [19:23:18] NOOOOOOOOOOOOO [19:23:20] http://www.tomshardware.com/news/linux-3.1-kernel-linus-torvalds-wii-controller,13801.html [19:23:37] You can control Linux with a WII controller? [19:23:41] *Linus [19:23:58] "There is also a stub driver to enable Linux to work with Nintendo's Wii remote via the HID protocol over Bluetooth. If this feature is received well by the community, we could be seeing a version of gesture based computing or gaming for Linux devices. " [19:25:00] http://hacknmod.com/hack/top-30-wiimote-hacks-of-the-web/ [19:25:08] so much you can do with it:) [19:26:04] notpeter: Theremin http://blog.amnesiarazorfish.com.au/theremin-wii-remote-hack/ :) [19:27:53] <^demon> I don't have any wiimotes or games or anything anymore since I gave all that to my brother for christmas. All I've got is the non-booting wii. [19:28:05] <^demon> Now I kinda wish I'd hung onto at least one of the controllers. [19:31:52] New patchset: Raimond Spekking; "Cleanup dewiki import sources" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39793 [19:31:56] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:33:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.377 seconds [19:33:45] New patchset: Ottomata; "Monitoring analytics webrequest udp2log instances." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39790 [19:36:50] New review: Pyoungmeister; "talked to andrew a bunch and this seemed like the most reasonable way to monitor some udp2log instan..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/39790 [19:36:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39790 [19:49:54] "Mediawiki-enterprise"-l MediaWiki for enterprises [19:51:12] New patchset: Pyoungmeister; "Revert "Revert "Revert "Revert "Revert "flip flopping db61 manifests for another test"""""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39794 [19:53:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39794 [20:07:27] New review: MarkTraceur; "Is there any word on this yet?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/35285 [20:08:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.445 seconds [20:54:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [21:33:12] !log upgrading salt on all minions [21:33:20] Logged the message, Master [21:33:35] !log also upgrading: libzmq3, python-zmq on all minions [21:33:43] Logged the message, Master [21:33:51] seasoning all minions [21:34:48] :D [21:35:20] heh [21:35:38] mark: you guys are using xfs for ceph, right? [21:35:41] I <3 running one command and having every system update itself [21:35:45] AaronSchulz: yes [21:35:48] ok [21:35:51] will test btrfs as well at some point [21:36:07] you're a brave one with btrfs [21:36:42] one a small portion of OSDs, why not [21:36:56] on* [21:37:58] with btrfs it doesn't need a journal, which is nice [21:38:36] yeah [21:38:44] btrfs definitely has advantages [21:38:51] are you going to run a really new kernel? [21:39:00] if we test that, yeah [21:39:26] cool. looking forward to seeing the results [21:39:33] the ceph guys recommend precise+1's kernel [21:39:36] (i forget its name) [21:39:58] quantal? [21:40:00] right [21:41:10] AaronSchulz: just as a heads up, we're probably going to put originals and thumbs in separate ceph pools [21:41:23] which means that mediawiki may have to specify a pool name in a header when creating (swift) containers [21:41:40] I imagine that's a trivial change [21:42:06] hm. 3 minions aren't sending salt-pings after upgrade [21:42:20] pepper their asses [21:43:11] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [21:43:11] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - swift-account-server on ms-be1010 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - swift-container-updater on ms-be1010 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - SSH on ms-be1010 is CRITICAL: Connection refused [21:43:12] PROBLEM - swift-object-updater on ms-be1010 is CRITICAL: Connection refused by host [21:43:23] mark: is the header needed when stating/deleting containers or just on creation? [21:43:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:29] AaronSchulz: just creation [21:43:45] if it's not there ceph puts all containers in one default pool [21:43:59] and for thumbs we may want to have different crush rules, and e.g. less replicas [21:44:09] awesome [21:44:17] so on container creation, just something like X-Ceph-Pool: thumbnails [21:44:18] that should do [21:44:18] yeah, I could hack CF a little for that, doesn't sound hard [21:44:30] does php cloudfiles suck as much as python cloudfiles? [21:44:31] jesus. [21:44:46] it does suck, though we forked it a lot [21:44:51] right [21:44:59] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.747 second response time [21:45:06] upstream has a changed merged that was an incomplete wip [21:45:06] i imported and rewrote a lot of its methods in my script [21:45:08] PROBLEM - Memcached on marmontel is CRITICAL: Connection refused [21:45:08] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:45:09] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [21:45:09] PROBLEM - swift-object-auditor on ms-be1010 is CRITICAL: Connection refused by host [21:45:09] PROBLEM - swift-container-auditor on ms-be1010 is CRITICAL: Connection refused by host [21:45:09] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [21:45:09] PROBLEM - swift-account-auditor on ms-be1010 is CRITICAL: Connection refused by host [21:45:17] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:22] and had to work around a lot of its limitations in horrid ways ;) [21:45:35] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [21:45:35] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [21:45:35] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [21:45:35] PROBLEM - swift-object-replicator on ms-be1010 is CRITICAL: Connection refused by host [21:45:35] PROBLEM - swift-account-reaper on ms-be1010 is CRITICAL: Connection refused by host [21:45:35] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:45:36] PROBLEM - SSH on sq48 is CRITICAL: Connection refused [21:45:36] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:46] mark: pepper them I will! [21:45:53] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: Connection refused [21:45:53] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:45:53] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:45:53] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:45:53] PROBLEM - swift-account-replicator on ms-be1010 is CRITICAL: Connection refused by host [21:45:53] PROBLEM - swift-container-server on ms-be1010 is CRITICAL: Connection refused by host [21:45:54] PROBLEM - swift-object-server on ms-be1010 is CRITICAL: Connection refused by host [21:46:02] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-dialog-sri-lanka.log, /a/squid/zero-vodaphone-india.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [21:46:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:04] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:12] what's with all these swift nagios errors? [21:46:17] eqiad doesn't have swift [21:46:19] those are ceph boxes [21:46:21] ah [21:46:27] but I dunno, faidon was working on them [21:46:31] he just left [21:48:18] oh [21:48:19] btw [21:48:29] New patchset: Ottomata; "Fixing monitoring for udp2log packet loss on analytics machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39827 [21:48:32] I was checking out the downed salt minions the other day [21:48:39] and a few boxes are actually just broken :D [21:48:53] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [21:48:59] I need to put in rts for those [21:49:04] a couple were ms boxes, though [21:50:56] hehe [21:50:58] ok i'm going off [21:51:00] have a nice weekend [21:51:03] bye [21:53:21] later [21:53:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39827 [21:54:50] New patchset: Ottomata; "Including ldap on all analytics nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39828 [21:55:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39828 [21:59:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.664 seconds [22:03:08] PROBLEM - NTP on sq48 is CRITICAL: NTP CRITICAL: No response from NTP server [22:05:05] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: No response from NTP server [22:05:59] PROBLEM - NTP on ms-be1010 is CRITICAL: NTP CRITICAL: No response from NTP server [22:20:04] ori-l-away: I just added libzmq3 to all systems [22:20:20] and upgraded python-zmq to 2.2.0 that is compiled against libzmq3 [22:20:32] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [22:23:23] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:30:30] mutante: http://hire.jobvite.com/Jobvite/Job.aspx?j=o4cKWfwG&c=qSa9VfwQ [22:30:46] awjr: ah cool, thx [22:30:50] :D [22:31:29] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [22:33:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.947 seconds [23:06:06] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [23:09:06] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:09:17] New patchset: Pyoungmeister; "removing no longer needed mod to mysql path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39833 [23:11:06] New patchset: Pyoungmeister; "removing no longer needed mod to mysql path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39833 [23:11:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39833 [23:22:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:31:48] New patchset: Pyoungmeister; "adding eqiad imagescalers to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39835 [23:34:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39835 [23:36:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.065 seconds [23:39:34] New patchset: Pyoungmeister; "recommissioning mw1001-mw1160" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39837 [23:40:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39837