[00:00:08] PROBLEM - Puppet freshness on cp1024 is CRITICAL: Puppet has not run in the last 10 hours [00:06:08] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:06:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:43:11] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [00:43:11] PROBLEM - Puppet freshness on cp1027 is CRITICAL: Puppet has not run in the last 10 hours [00:51:08] PROBLEM - Puppet freshness on cp1028 is CRITICAL: Puppet has not run in the last 10 hours [01:26:58] New patchset: Tim Starling; "More rights for mediawikiwiki sysops" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22536 [01:27:26] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22536 [01:35:00] morning TimStarling [01:35:10] hello [01:40:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [01:41:05] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 268 seconds [01:47:14] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 638s [01:56:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:59:41] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 7s [02:00:53] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 12 seconds [02:50:14] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Tue Sep 4 02:50:05 UTC 2012 [02:51:28] !log powering down ms-be6, broken hardware [02:51:36] Logged the message, Master [02:54:26] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: Connection refused by host [02:54:35] PROBLEM - swift-object-server on ms-be6 is CRITICAL: Connection refused by host [02:54:44] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: Connection refused by host [02:54:44] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: Connection refused by host [02:54:53] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: Connection refused by host [02:54:53] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [02:55:02] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: Connection refused by host [02:55:02] PROBLEM - swift-account-server on ms-be6 is CRITICAL: Connection refused by host [02:55:20] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: Connection refused by host [02:55:29] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: Connection refused by host [02:55:29] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: Connection refused by host [02:55:47] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: Connection refused by host [02:55:56] PROBLEM - swift-container-server on ms-be6 is CRITICAL: Connection refused by host [03:17:50] RECOVERY - Puppet freshness on snapshot1001 is OK: puppet ran at Tue Sep 4 03:17:18 UTC 2012 [04:04:38] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [04:09:44] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [04:33:08] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:11] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms [04:35:41] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [04:35:41] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [04:35:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:35:41] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [04:35:41] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [04:35:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [04:35:42] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [04:35:43] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [04:35:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:38:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:05:23] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [05:40:11] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:36:42] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [08:36:42] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [08:36:42] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [08:42:42] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:54:24] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [10:01:18] PROBLEM - Puppet freshness on cp1024 is CRITICAL: Puppet has not run in the last 10 hours [10:07:18] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:07:18] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:43:51] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [10:43:51] PROBLEM - Puppet freshness on cp1027 is CRITICAL: Puppet has not run in the last 10 hours [10:51:57] PROBLEM - Puppet freshness on cp1028 is CRITICAL: Puppet has not run in the last 10 hours [11:58:41] paravoid: ping [12:38:10] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [12:51:13] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [12:55:06] !log Pushed varnish 3.0.3plus-rc1 packages into the precise-wikimedia APT repository [12:55:17] Logged the message, Master [13:05:07] New patchset: Mark Bergsma; "Import debian/ dir from testing/persistent" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/22545 [13:05:07] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm1) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/22546 [13:07:17] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/22545 [13:07:44] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/22546 [13:20:37] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [13:20:55] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [13:20:55] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:21:40] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [13:29:58] New patchset: Hashar; "(bug 38299) alias 'cmr10' font to 'Computer Modern'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22533 [13:31:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22533 [13:34:58] New review: Hashar; "Patchset 2 enforce the "Roman" style for cmr10 alias:" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/22533 [13:35:10] font madness [13:42:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.064 seconds [13:47:14] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (66170) [13:47:59] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (66291) [13:52:20] refreshLinks2: 65399 [13:52:24] oh wonderbar [13:53:06] hashar: it's alright, they'll OOM ;) [14:01:46] I am not sure why there are soooo many of them [14:04:04] Reedy: seems like the zhwiki has implemented wikidata using templates :-] [14:04:15] ... [14:04:15] Country_data_Japan Country_data_Spain ... [14:04:19] I shouldn't be suprised... [14:05:00] there are tons of duplicates too [14:05:13] isn't job_namespace,job_title a primary key? [14:05:23] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [14:05:24] na its not [14:05:25] hmm [14:05:38] New patchset: Mark Bergsma; "Temporarily disable cp1029-1036" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22553 [14:06:30] New patchset: Mark Bergsma; "Send originals and thumbs/temps to Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22554 [14:07:16] maybe they are just pilling up [14:07:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22553 [14:07:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22554 [14:10:20] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [14:11:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22553 [14:12:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22554 [14:14:49] New patchset: Demon; "Only set up SMTP for gerrit production host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22217 [14:15:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22217 [14:17:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [14:32:07] New review: Demon; "Not sure what this error means:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/22215 [14:36:17] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [14:36:17] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [14:36:17] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [14:36:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:36:17] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [14:36:18] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [14:36:18] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [14:36:19] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [14:36:19] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:43:04] New patchset: Demon; "Puppetized gerrit replication config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [14:43:55] New review: Demon; "PS4 is based on testing I did on the labs install--worked as expected." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/22215 [14:43:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [14:54:58] New patchset: Mark Bergsma; "Don't do changes on cp1029-1036 for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22565 [14:55:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22565 [14:56:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22565 [14:59:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.309 seconds [15:16:51] RECOVERY - Puppet freshness on cp1028 is OK: puppet ran at Tue Sep 4 15:16:26 UTC 2012 [15:19:05] RECOVERY - Varnish HTTP upload-backend on cp1028 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.054 seconds [15:20:17] RECOVERY - Varnish HTCP daemon on cp1028 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [15:20:44] RECOVERY - Varnish HTTP upload-frontend on cp1028 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.061 seconds [15:21:11] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [15:27:20] RECOVERY - Puppet freshness on cp1027 is OK: puppet ran at Tue Sep 4 15:27:07 UTC 2012 [15:30:11] RECOVERY - Varnish HTCP daemon on cp1027 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [15:30:29] RECOVERY - Varnish HTTP upload-backend on cp1027 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.054 seconds [15:30:56] RECOVERY - Varnish HTTP upload-frontend on cp1027 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:31:50] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:29] RECOVERY - NTP on cp1028 is OK: NTP OK: Offset -0.05854594707 secs [15:40:59] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:42:20] RECOVERY - Puppet freshness on cp1023 is OK: puppet ran at Tue Sep 4 15:42:06 UTC 2012 [15:47:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:53] RECOVERY - NTP on cp1027 is OK: NTP OK: Offset -0.05533313751 secs [16:00:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [16:01:14] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [16:01:14] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.056 seconds [16:02:53] RECOVERY - Puppet freshness on cp1024 is OK: puppet ran at Tue Sep 4 16:02:21 UTC 2012 [16:03:29] RECOVERY - NTP on cp1023 is OK: NTP OK: Offset -0.04468381405 secs [16:04:32] RECOVERY - Varnish HTCP daemon on cp1024 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [16:06:02] RECOVERY - Varnish HTTP upload-backend on cp1024 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.054 seconds [16:10:23] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Tue Sep 4 16:10:15 UTC 2012 [16:12:38] RECOVERY - Varnish HTTP upload-backend on cp1026 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.982 seconds [16:14:08] RECOVERY - Varnish HTCP daemon on cp1026 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [16:21:56] RECOVERY - NTP on cp1024 is OK: NTP OK: Offset -0.04961526394 secs [16:26:08] !log reimaging searchidx1001 [16:26:17] Logged the message, notpeter [16:26:33] !log temp stopping puppet on brewster [16:26:42] Logged the message, notpeter [16:29:53] RECOVERY - NTP on cp1026 is OK: NTP OK: Offset -0.05717396736 secs [16:30:11] PROBLEM - Host searchidx1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:53] RECOVERY - Host searchidx1001 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [16:39:47] PROBLEM - Lucene disk space on searchidx1001 is CRITICAL: Connection refused by host [16:40:23] PROBLEM - SSH on searchidx1001 is CRITICAL: Connection refused [16:40:38] Change abandoned: Hashar; "we use role::upload::cache on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16115 [16:41:34] Change abandoned: Demon; "Not going to get back to this soon -- abandoning for now. Can always restore if I change my mind abo..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21725 [16:46:10] New review: Hashar; "We need to rebuild the l10n cache. There is code for it hidden in scap, that need to be extracted ou..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/22116 [16:46:32] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:46:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [16:58:14] PROBLEM - NTP on searchidx1001 is CRITICAL: NTP CRITICAL: No response from NTP server [17:15:20] RECOVERY - Lucene disk space on searchidx1001 is OK: DISK OK [17:21:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:44] !log search32 shutting down to replace processor [17:22:53] Logged the message, Master [17:29:26] New patchset: RobH; "adding helium and potassium as poolcounter hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22585 [17:30:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22585 [17:32:26] RECOVERY - NTP on searchidx1001 is OK: NTP OK: Offset -0.04581332207 secs [17:35:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [17:39:36] New review: RobH; "self review is the best review!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22585 [17:39:36] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22585 [17:41:04] !log pulled latest update of branch 1.20wmf10 on labsconsole [17:41:13] Logged the message, Master [17:41:17] Ryan_Lane: you working from home? [17:41:21] yep [17:41:25] cheater ;p [17:41:34] I'll be coming in the rest of the week [17:42:56] heh [17:43:27] RECOVERY - Host search32 is UP: PING WARNING - Packet loss = 80%, RTA = 0.23 ms [17:45:15] New patchset: RobH; "tweaking the partman file for helium install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22587 [17:46:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22587 [17:46:27] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:57] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:49:45] New patchset: RobH; "tweaking the partman file for helium install, and netboot listing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22587 [17:50:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22587 [17:50:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22587 [17:53:02] notpeter: search32�.processor is in�.ran a puppet update�we'll see [17:53:47] cmjohnson1: sweet! thank you [17:54:07] !log helium being reinstalled to poolcounter server per RT#3407 [17:54:16] Logged the message, RobH [18:07:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22495 [18:09:09] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16866 [18:20:05] anyone with RT access about? Can you tell me if there's a request for the mailing list rename of chaptercommittee-l to affcom? [18:20:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.417 seconds [18:21:24] mark: pong, a little late :) [18:22:09] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [18:22:54] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [18:37:27] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [18:37:27] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [18:37:27] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [18:38:01] !log srv281 going down for troubleshooting [18:38:10] Logged the message, Master [18:43:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:54:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:15] New review: Umherirrender; "Nobody here, to get a +2 for this patch set? Is there a problem with this patch set, that there is n..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/21326 [19:00:35] New review: Dzahn; "I manually copied the config from patch set 3 to srv193 and tried to reload Apache." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13293 [19:08:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [19:09:46] !log swift consistency check (heads of various objects) running from bastion1001 as ariel in screen session, about 10 HEADS/sec, if this starts to impact swift just shoot it [19:09:55] Logged the message, Master [19:11:39] New review: Reedy; "I'm still catching up from a week without internet access" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/21326 [19:35:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21061 [19:36:24] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (25596), zhwiki (29899) [19:37:45] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2713* [19:41:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:24] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2588* [19:43:54] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 2363 [19:51:15] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (16751), zhwiki (32287) [19:53:39] RECOVERY - Host srv281 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [19:55:27] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [19:55:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [19:55:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22484 [19:56:00] ottomata: ^ [19:56:45] danke! [19:56:58] np [19:57:04] something still isn't quite right with that, but I need rfaulkner to help me fix [19:57:06] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [19:57:06] not sure what is up [19:57:10] but that is all def needed [19:57:11] so thanks [19:57:23] alright, sure [20:00:29] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:01:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22619 [20:04:21] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:05:07] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22619 [20:05:37] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:06:06] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.070 second response time [20:06:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22619 [20:08:30] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [20:08:30] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [20:21:43] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:22:35] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22619 [20:22:49] Hi. Have you touched the SUL databases? - My global unified login isn't working anymore on some wikis and I've not touched anything. [20:23:05] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:23:16] What error are you getting? [20:23:20] "you" is vague [20:23:44] you = Dear Wikimedia Tech Team :) [20:23:56] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22619 [20:24:11] I can't log in anymore on, for example, bnwiki in spite of having a SUL there [20:24:12] New review: Asher; "Why is WLM reliant on toolserver at all? If the api portion is being moved to wmf, why not the data..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/17964 [20:24:41] Incorrect password message, that's the error I get. [20:25:23] New patchset: Demon; "Make SSL unconditional for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:26:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22619 [20:27:18] When do you know you could last logon? [20:27:35] Yesterday. [20:27:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:51] brb, phone call [20:31:18] PROBLEM - Lucene on search1021 is CRITICAL: Connection refused [20:34:27] RECOVERY - Lucene on search1021 is OK: TCP OK - 0.027 second response time on port 8123 [20:37:54] !log srv291 shutting down for troubleshooting [20:38:03] Logged the message, Master [20:38:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.332 seconds [20:42:17] New patchset: Pyoungmeister; "lucene.php: moving en search back to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22662 [20:44:16] heya paravoid, you around? if so, I have a probably easy to answer .deb packaging q [20:45:20] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22662 [20:45:34] !log moving en search back to eqiad [20:45:44] Logged the message, notpeter [20:50:12] ottomata: yes [20:50:43] heya [20:50:58] so I think i've got things pretty good for my libanon stuff, need to build .deb [20:51:04] i want to build packages for both lucid and precise [20:51:29] do I need to modify changelog and then build each? or just build each and then rename .deb files to include ~lucid or ~precise? [20:51:37] paravoid: thanks for you reply to the syslog mail :) [20:52:03] never ever rename .debs [20:52:48] you should add stanzas in debian/changelog like 1.0-1~lucid1 with a message that says e.g. "Backport to lucid" and rebuild [20:53:02] this will create proper source/binary packages [20:53:35] hm, ok cool [20:53:51] is it proper to commit that to the repository with my debian/ stuff? [20:54:18] what repository? [20:54:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22619 [20:54:54] ah paravoid, some good news about the swift backend, I'm running a consistency check (= head on all object sin a few containers), and did half of one container with everything looking very good, that's about 30k objects. [20:54:58] the git way of doing that is having a master branch (upstream code only), a debian branch (up to date debian packaging, frequent merges with master) and a lucid and a precise branch [20:55:11] ahh cool [20:55:18] hmmmmm [20:55:34] that does sound nicer [20:55:43] this is the repo [20:55:44] https://gerrit.wikimedia.org/r/gitweb?p=analytics/libanon.git;a=tree [20:55:44] do you use git-buildpackage? [20:55:49] no, just dpkg-buildpackage [20:56:07] well... :) git-bp is nice [20:56:13] handles the master/debian branches nicely [20:56:20] reading... [20:56:21] cool [20:56:34] btw, feel free to add me as a reviewer on such commits if you like [20:57:18] cool, will do! i would love if you could do a post review on the debian/ dir that is already there [20:58:08] looks good; have you seen dh? [20:58:14] you could use it for even simpler debian/rules [20:58:46] i think so…but not really, i've seen it do the default stuff, but, since this came with an autogen.sh script [20:58:55] i wanted to just call that in the configure step [20:59:02] can I still use dh to do that? [20:59:50] you need dh-autoreconf, which is a separate package [20:59:55] Reedy: You might want to have a look at https://gerrit.wikimedia.org/r/22534 those were your files (according to git blame) [20:59:56] New patchset: Demon; "Puppetized gerrit replication config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [21:00:01] but, yes, it handles this case [21:00:38] hm. [21:00:48] hoo: they were only "mine" so much as in I added them to the git repo [21:00:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [21:01:15] Suspected that as they were both May this year :/ [21:01:25] Do you think they're safe to delete, though? [21:01:52] !log moving en search traffic back to pmtpa [21:02:01] New patchset: Demon; "Only set up SMTP for gerrit production host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22217 [21:02:02] Logged the message, notpeter [21:02:49] New patchset: Pyoungmeister; "lucene.php: moving search traffic back to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22666 [21:02:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22217 [21:03:06] paravoid: still no wait-io on virt6-8 [21:03:19] Ryan_Lane: hmmmm, interesting. [21:03:23] yep [21:03:31] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22666 [21:03:34] I'm not going to complain, but I'd like to know why :) [21:04:00] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:05:10] notpeter: can throw srv266 back into the pool later today as well [21:05:20] cmjohnson1: yep [21:05:57] thx [21:08:21] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [21:11:21] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.348 second response time [21:12:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:29] New patchset: Pyoungmeister; "Revert "Using log4j to log Lucene results to udp2log."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22667 [21:23:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22667 [21:23:54] New review: Pyoungmeister; "for now..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/22667 [21:23:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22667 [21:24:27] I caught someone's changes [21:24:30] pertaining to gerrit [21:25:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.810 seconds [21:26:01] TomDaley's [21:26:07] notpeter: maybe related to ^demon's mail just now? [21:26:23] ahh, yes, TomDaley == ^demon [21:26:44] where just now is 15 mins back [21:27:26] I don't see it... well, it's live now! [21:27:30] hope that's a positive thing... [21:28:05] * Damianz waits for prod gerrit to break :D [21:28:27] notpeter: actually it was labs-l [21:28:54] ah, gotcha [21:29:40] New patchset: Ottomata; "SearchDaemon.java - moving null checking into logResults() and out of encode() method." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/22668 [21:29:41] New patchset: preilly; "Orange Cameroon added new IP ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22669 [21:30:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22669 [21:31:47] notpeter: can you merge this change set https://gerrit.wikimedia.org/r/#/c/22669/ [21:31:53] preilly: usre [21:32:13] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22669 [21:32:21] preilly: need force puppet run/ [21:32:23] TomDaley: anything we should have in here that we currently dont? https://labsconsole.wikimedia.org/wiki/MediaWiki:Titleblacklist [21:32:31] TomDaley: for instance, characters we shouldn't allow? [21:32:34] notpeter: yes [21:32:36] preilly: kk [21:33:25] doing so now [21:33:36] Ryan_Lane: gerrit2? :) [21:33:43] Ryan_Lane: cron(d) ? [21:34:00] is cron a user? [21:34:21] TomDaley: gerrit2 is a user [21:34:33] so is novaadmin [21:34:39] I thought that was what we were blocking, not letting people register the name. [21:34:58] it's already registered :) [21:35:15] "User account "Gerrit2" is not registered." [21:35:20] you already can't register names that are already registered [21:35:25] huh [21:35:34] https://labsconsole.wikimedia.org/wiki/User:Gerrit2 says no way [21:35:46] Ryan_Lane: perfect timing for 2-factor [21:35:58] tech days? [21:35:59] yeah [21:36:01] no [21:36:03] gerrit-dev ldap login [21:36:10] ah [21:36:11] heh [21:36:12] yeah [21:36:22] root@gerrit-dev can basically escalate to anyone else [21:36:27] yep [21:36:36] to whoever's using it anyway (I didn't in fear of that) [21:36:43] ugh, why are we using that for auth? [21:36:44] not a huge fan of allowing apps to ldap auth in labs [21:36:50] me neither [21:36:56] we should really have another ldap server locally [21:36:58] to test against [21:37:05] Yes, please. [21:37:23] and anyone can set their own ldap passwd in that other instance as much as they want (from the shell) [21:37:37] can we also make ishmael/graphite use that new ldap too? [21:37:45] eh? [21:37:46] new ldap for what? [21:37:49] graphite is in producton [21:38:01] paravoid: an ldap on localhost on that one box [21:38:14] so what. it's not a big deal if someone can break into graphite [21:38:17] i think [21:38:24] 04 21:36:56 < Ryan_Lane> we should really have another ldap server locally [21:38:27] graphite already ldap auths [21:38:31] locally = new ldap [21:38:35] no no no [21:38:40] ldap::self? :) [21:39:00] I'm saying that apps that want to test ldap should each have their own ldap [21:39:01]