[00:13:40] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:31] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97057 bytes in 1.970 second response time [00:18:30] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:40] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:30] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97057 bytes in 0.888 second response time [00:24:40] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:40] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97057 bytes in 9.233 second response time [00:29:00] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:00] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 6.942 second response time [00:33:40] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:40] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97055 bytes in 4.024 second response time [00:36:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:48:37] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:27] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66718 bytes in 0.708 second response time [00:58:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:59:37] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:27] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 0.702 second response time [01:18:04] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:19:54] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97057 bytes in 0.782 second response time [01:23:04] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:54] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97055 bytes in 0.797 second response time [01:37:12] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:12] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 6.840 second response time [01:38:42] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:52] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [01:39:02] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:02] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 8.396 second response time [01:40:42] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 7.540 second response time [01:43:02] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:43:22] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:22] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 9.383 second response time [01:45:12] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:12] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66720 bytes in 9.847 second response time [01:48:02] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97057 bytes in 7.505 second response time [01:53:02] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:02] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 97055 bytes in 8.883 second response time [01:54:32] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:22] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66721 bytes in 0.790 second response time [10:19:11] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [10:22:11] (03PS1) 10Faidon: Åland, Guernsey, Isle of Man and Jersey to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80970 [10:22:13] (03PS1) 10Faidon: Africa to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80971 [10:22:13] (03PS1) 10Faidon: Middle-East to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 [10:22:14] (03PS1) 10Faidon: Switch Central/South Asia to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 [10:23:47] (03CR) 10Faidon: "This includes Iran and will certainly mess with the HTTPS A/B tests as the esams IPs wouldn't be blocked yet. Cc'ing Ryan/Ori." [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 (owner: 10Faidon) [10:27:35] (03PS1) 10Faidon: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [10:28:12] mark: I think I'm gonna let you decide on whether we should switch half of the world to esams :) [10:28:35] oh c'mon [10:49:29] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:19] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:22:29] yay [11:24:27] hi paravoid :-] [11:39:08] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [12:04:49] (03PS1) 10Mark Bergsma: Initial version of PROXY support for Varnish [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/80982 [12:27:35] (03PS1) 10Hashar: beta: resurect arwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80983 [12:28:07] (03CR) 10Hashar: [C: 032] beta: resurect arwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80983 (owner: 10Hashar) [12:28:16] (03Merged) 10jenkins-bot: beta: resurect arwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80983 (owner: 10Hashar) [13:05:44] (03CR) 10Mark Bergsma: [C: 04-1] "(2 comments)" [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/80982 (owner: 10Mark Bergsma) [13:51:30] (03PS1) 10Petr Onderka: Make sure revisions of a page are sorted by their id [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/80987 [13:53:08] goood mornnniiing/afternoon paravoid! [13:53:49] you around? kafka debian branching q for you [13:54:06] he's on vacation [13:55:07] (03CR) 10Petr Onderka: [C: 032 V: 032] Make sure revisions of a page are sorted by their id [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/80987 (owner: 10Petr Onderka) [13:58:34] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:34] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: No successful Puppet run in the last 10 hours [14:23:01] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: No successful Puppet run in the last 10 hours [14:28:01] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [14:39:01] (03PS1) 10Hashar: points wikidata.org to lb-wikidata [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 [14:39:53] (03PS1) 10Ottomata: Allow Christian to sudo -u stats to debug and test stats' automated cron jobs on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/80994 [14:39:57] (03CR) 10Hashar: "My first ever DNS change on Wikimedia infrastructure :-}" [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [14:40:29] hahaha [14:41:22] (03CR) 10Mark Bergsma: [C: 04-2] "Your first ever, invalid DNS change on Wikimedia infrastrucuture. ;-)" [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [14:42:35] mark: can't we use CNAME ? :( [14:42:51] you can use CNAME but not there [14:42:59] you're effectively aliasing the entire domain [14:43:46] so maybe: [14:43:46] wikidata.org. IN CNAME wikidata-lb.wikimedia.org. [14:43:53] (damn missed 1H) [14:44:26] no [14:44:35] please read a good book on DNS if you're gonna propose DNS changes [14:44:41] this is terribly invalid and would break everything [14:47:35] don't break wikidata :) [14:49:39] (03CR) 10Mark Bergsma: [C: 031] Allow Christian to sudo -u stats to debug and test stats' automated cron jobs on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/80994 (owner: 10Ottomata) [14:50:09] (welcome back, btw ;-) [14:51:36] (03PS2) 10Hashar: points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 [14:52:00] so the RFC says nothing can comes along a CNAME [14:52:14] (03CR) 10Ottomata: [C: 032 V: 032] Allow Christian to sudo -u stats to debug and test stats' automated cron jobs on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/80994 (owner: 10Ottomata) [14:52:15] and since wikidata.org. already has a SOA … [14:53:18] and also every record below it [14:55:52] that would duplicate the entries once more ? [14:56:10] (03PS1) 10Aude: Enable data transclusion for wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 [14:56:44] and we need geo dns for the www.wikidata.org entry since that is the wiki [14:56:47] (03CR) 10Aude: [C: 04-1] "this should wait until after 1.22wmf14 deployment" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 (owner: 10Aude) [14:57:59] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:59] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:10:39] the education program folks in Brazil are hoping this configuration request can be done soon: https://bugzilla.wikimedia.org/show_bug.cgi?id=52870 [15:14:22] (03CR) 10Yuvipanda: "Patch Set 1: Code-Review+2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80332 (owner: 10Tim Landscheidt) [15:15:49] PROBLEM - DPKG on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:19] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: No successful Puppet run in the last 10 hours [15:20:19] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: No successful Puppet run in the last 10 hours [15:20:19] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [15:22:19] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: No successful Puppet run in the last 10 hours [15:51:55] (03PS1) 10QChris: Fix host for geowiki's research connection [operations/puppet] - 10https://gerrit.wikimedia.org/r/81004 [16:02:32] (03PS2) 10Hashar: ORI FOR KING OF DEPLOYMENTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/77838 (owner: 10Pyoungmeister) [16:03:04] (03Abandoned) 10Hashar: ORI FOR KING OF DEPLOYMENTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/77838 (owner: 10Pyoungmeister) [16:09:15] :) [16:09:22] ohh peter... [16:29:11] !log bringing analytics1005 down for reinstall [16:29:16] Logged the message, Master [16:30:53] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:07] thanks greg-g [16:34:37] ragesoss: np [16:35:58] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:36:08] PROBLEM - NTP on analytics1005 is CRITICAL: NTP CRITICAL: No response from NTP server [16:36:08] PROBLEM - SSH on analytics1005 is CRITICAL: Connection refused [16:36:18] PROBLEM - RAID on analytics1005 is CRITICAL: Connection refused by host [16:36:38] PROBLEM - DPKG on analytics1005 is CRITICAL: Connection refused by host [16:36:48] PROBLEM - Disk space on analytics1005 is CRITICAL: Connection refused by host [16:37:25] !log taking down analytics1006, analytics1008 and analytics1009 for reinstall [16:37:30] Logged the message, Master [16:37:58] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:48] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:08] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:58] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 16%, RTA = 0.35 ms [16:43:18] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [16:43:58] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:45:08] PROBLEM - Disk space on analytics1006 is CRITICAL: Connection refused by host [16:45:08] PROBLEM - SSH on analytics1006 is CRITICAL: Connection refused [16:45:18] PROBLEM - RAID on analytics1006 is CRITICAL: Connection refused by host [16:45:18] PROBLEM - Disk space on analytics1008 is CRITICAL: Connection refused by host [16:45:38] PROBLEM - DPKG on analytics1006 is CRITICAL: Connection refused by host [16:45:38] PROBLEM - SSH on analytics1008 is CRITICAL: Connection refused [16:45:48] PROBLEM - DPKG on analytics1008 is CRITICAL: Connection refused by host [16:45:58] PROBLEM - DPKG on analytics1009 is CRITICAL: Connection refused by host [16:46:08] PROBLEM - Disk space on analytics1009 is CRITICAL: Connection refused by host [16:46:18] PROBLEM - RAID on analytics1008 is CRITICAL: Connection refused by host [16:46:18] PROBLEM - SSH on analytics1009 is CRITICAL: Connection refused [16:46:38] PROBLEM - RAID on analytics1009 is CRITICAL: Connection refused by host [16:51:38] (03CR) 10Demon: [C: 032] Enable secure login on mw.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80787 (owner: 10Demon) [16:51:50] (03Merged) 10jenkins-bot: Enable secure login on mw.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80787 (owner: 10Demon) [16:52:43] !log demon synchronized wmf-config/InitialiseSettings.php 'mediawikiwiki to secure login' [16:52:49] Logged the message, Master [16:53:18] (03CR) 10Demon: [C: 032] Proposed configuration for wgSecureLogin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 (owner: 10Tim Starling) [16:53:27] (03Merged) 10jenkins-bot: Proposed configuration for wgSecureLogin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 (owner: 10Tim Starling) [16:54:45] !log demon synchronized wmf-config/InitialiseSettings.php 'GeoIP configuration for secure login' [16:54:51] Logged the message, Master [16:55:36] !log demon synchronized wmf-config/CommonSettings.php 'GeoIP configuration for secure login' [16:55:41] Logged the message, Master [16:55:57] Coren: do you approve labs projects or just ryan? [16:56:28] hoo in -dev is asking about it, figured i could at least ask you ;] [16:56:29] RobH: Either/or, though I'll tend to defer to Ryan in edge cases. [16:56:46] his project is https://wikitech.wikimedia.org/wiki/New_Project_Request/accessibility-dev [16:57:00] just passing it along cuz i idle in there =] [16:57:28] PROBLEM - NTP on analytics1008 is CRITICAL: NTP CRITICAL: No response from NTP server [16:57:28] PROBLEM - NTP on analytics1006 is CRITICAL: NTP CRITICAL: No response from NTP server [16:58:08] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:28] PROBLEM - NTP on analytics1009 is CRITICAL: NTP CRITICAL: No response from NTP server [17:03:18] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:05:31] !log csteipp synchronized php-1.22wmf14/extensions/LiquidThreads 'Fix bug53320' [17:05:36] Logged the message, Master [17:06:00] !log csteipp synchronized php-1.22wmf13/extensions/LiquidThreads 'Fix bug53320' [17:06:05] Logged the message, Master [17:14:54] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:24] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:44] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:25] Hi LeslieCarr! [17:18:35] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [17:19:24] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [17:20:14] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:34] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [17:25:24] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [17:38:32] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:42:22] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:42:42] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:32] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:32] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [17:56:52] RECOVERY - Host analytics1006 is UP: PING WARNING - Packet loss = 73%, RTA = 0.37 ms [17:59:04] ottomata: hey [17:59:33] hiya [17:59:41] we communicated via email, communcation: successful [17:59:51] :) [18:00:03] ottomata: actually [18:00:09] i just realized this is the projects meeting, not the real ops meeting [18:00:11] i don't need to be there [18:00:34] and my flight boards at 12:50 , shall i go into the "secured area", grab a $REFRESHINGBEVERAGE and shold we do some work ? [18:00:38] or tomorrow i am free completely [18:01:12] ah, hm, how much time do you have? I am dying of hunger and was about to grab lunch [18:01:22] ah, less than a couple of hours [18:01:26] yeah 1h50m [18:01:32] minus about 15-20 for security + power finding [18:01:38] ha, yeah [18:01:54] hmm, maybe we can talk real quick about what we are about to do? [18:01:58] and actually do it tomorrow? [18:02:03] that sounds great [18:02:18] k [18:03:32] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:03:37] ok, so is our plan to completely wipe the machines, as in reinstall and then bring up the cluster as in the order of the module documentation ? [18:05:08] yup [18:05:09] totally [18:05:11] that is the plan [18:05:14] wipe them all [18:05:25] I'd like to use analytics1009 for the standby namenode [18:05:29] that isn't part of the hadoop cluster as is [18:05:38] analytics1010 will remain the primary namenode [18:05:46] 1009 and 1010 are both ciscos with the same sepcs [18:05:48] specs [18:06:07] so yeah, actually [18:06:08] cool [18:06:08] :) [18:06:16] Leslie, I suppose today I can go ahead and just reinstall the base system on the amchines [18:06:24] don't think you'll miss out much with that :) [18:06:31] just only include the standard class on all of them ? [18:06:38] ok yeah i can do that [18:06:49] these ciscos though, they are a pain [18:06:54] im' trying to reinstall 4 of them now [18:06:56] also, because it may page for some other stuff, you may need to do the puppetstoredcleanconfig.rb on stafford [18:06:58] one of which is 1009 [18:07:06] OH! [18:07:16] does that remove the virtual nagios/icinga resouces? [18:07:18] or it may page for any hadoop related pages that are set up [18:07:23] it clears out the catalog so it's reread [18:07:37] so then afterwards naggen will remove it [18:07:38] so yeah [18:07:40] yeah, totally getting pages about some stuff that we took offline today and last week (not hadoop) [18:07:45] great, i didn't realize that existed [18:07:52] :) [18:07:58] Ryan_Lane and I were talking about that on Friday I think [18:08:05] theoretically putting the machine in decom.pp fixes all… but we all know about theoretically :) [18:08:19] oh and you need to use FQDN and not just host name [18:08:20] ha, yeah, and we are re-using these nodes [18:08:22] k [18:08:32] ok, bbi15-20 ? [18:08:52] ok, wait real quick, where is puppetstoredcleanconfig.rb? [18:09:29] 1 sec [18:09:38] (i never remember, i just ctrl+r) [18:10:09] ok, i will run and do this lunch thing, probably back on in 30someish [18:10:12] /usr/local/sbin/puppetstoredconfigclean.rb [18:10:25] danke [18:10:31] example usage -- "puppetstoredconfigclean.rb iron.wikimedia.org" [18:10:35] on stafford [18:10:36] bye [18:10:38] great [18:10:38] danke [18:10:40] ja be back soon [18:14:49] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:16:39] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:39] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:18:39] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:23:32] (03PS1) 10Demon: Fix up $wmgHTTPSBlacklistCountries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81024 [18:25:32] (03CR) 10Reedy: [C: 032] Fix up $wmgHTTPSBlacklistCountries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81024 (owner: 10Demon) [18:25:44] (03Merged) 10jenkins-bot: Fix up $wmgHTTPSBlacklistCountries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81024 (owner: 10Demon) [18:27:55] !log reedy synchronized wmf-config/InitialiseSettings.php 'Fix wmgHTTPSBlacklistCountries' [18:28:01] Logged the message, Master [18:36:30] (03CR) 10Ryan Lane: [C: 032] Use nginx module for protoproxy and disable notify [operations/puppet] - 10https://gerrit.wikimedia.org/r/80401 (owner: 10Ryan Lane) [18:40:09] LeslieCarr: still around? [18:40:26] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, closed and wikimedia to 1.22wmf14 [18:40:31] Logged the message, Master [18:41:48] hey ops, is it correct to say that if I get "administratively prohibited: open failed" when connecting to tin, then I don't have deployment rights? [18:49:43] !log reedy synchronized wmf-config/ [18:49:52] Logged the message, Master [18:51:00] ^d: Still seems to be giving php warnings :/ [18:51:21] <^d> ffs. [18:51:21] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:51:23] <^d> Ok. [18:51:46] It's an array when checked in eval.php [18:51:53] <^d> lol. [18:51:56] <^d> global. [18:52:37] wheee [18:52:41] manybubbles, you indeed don't have access to deployment, only to mw13[234] [18:53:42] MaxSem: thanks. I wanted to get another deploy of CirrusSearch without bothing ^d too much [18:53:47] (03PS1) 10Demon: Globals are global [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81029 [18:56:09] manybubbles: i checked, you have shells on: search20,search19,search14,... gadolinium,searchidx1001,mw131,mw132,locke [18:56:11] (03CR) 10Reedy: [C: 032] Globals are global [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81029 (owner: 10Demon) [18:56:21] (03Merged) 10jenkins-bot: Globals are global [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81029 (owner: 10Demon) [18:56:23] thanks! [18:56:28] eh yeah, a few more search boxen [18:56:30] np [18:59:06] !log reedy synchronized wmf-config/CommonSettings.php [18:59:12] Logged the message, Master [19:04:23] thanks mutante and MaxSem. [19:07:28] (03PS1) 10RobH: fixing dns for netmon1001 [operations/dns] - 10https://gerrit.wikimedia.org/r/81037 [19:08:03] !log i accidentally (yea right) pushed .gitreview to master rather than head, so no gerrit log for my change, opps! [19:08:07] Logged the message, RobH [19:08:19] !log for operations/dns.git [19:08:25] Logged the message, RobH [19:08:32] !log reedy synchronized database lists files: wikivoyage, wikiversity and wiktioanry to 1.22wmf14 [19:08:37] Logged the message, Master [19:09:11] <^d> Reedy: Tons of "Catchable fatal error: Object of class Wikibase\Reference could not be converted to string at /usr/local/apache/common-local/php-1.22wmf14/extensions/Wikibase/repo/includes/changeop/ChangeOpReference.php on line 163" :( [19:11:04] (03CR) 10Dzahn: [C: 032] Planet updates. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80732 (owner: 10Dereckson) [19:13:12] (03PS2) 10RobH: fixing dns for netmon1001 [operations/dns] - 10https://gerrit.wikimedia.org/r/81037 [19:13:42] (03PS1) 10Aude: add a client for test.wikidata clientDbList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81038 [19:13:48] (03CR) 10Dzahn: [C: 04-1] "not adding controversial feed per Jalexander" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80760 (owner: 10Dereckson) [19:14:10] (03PS1) 10Asher: add missing uploadlb6 ips [operations/puppet] - 10https://gerrit.wikimedia.org/r/81039 [19:14:40] (03CR) 10RobH: [C: 032 V: 032] fixing dns for netmon1001 [operations/dns] - 10https://gerrit.wikimedia.org/r/81037 (owner: 10RobH) [19:15:51] (03CR) 10Demon: [C: 031] "I'm an easy change just begging to be merged ;-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 (owner: 10Demon) [19:16:48] (03CR) 10Reedy: [C: 032] add a client for test.wikidata clientDbList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81038 (owner: 10Aude) [19:16:56] (03Merged) 10jenkins-bot: add a client for test.wikidata clientDbList [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81038 (owner: 10Aude) [19:17:11] (03CR) 10Dzahn: "is this stuff elsewhere now?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 (owner: 10Demon) [19:17:31] !log reedy synchronized wmf-config/CommonSettings.php [19:18:04] (03CR) 10Demon: "The IRC bot is a tool labs project now, no more crappy puppet stuff. All the old cruft is already gone from manganese, just cleaning out t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 (owner: 10Demon) [19:18:22] (03CR) 10Dzahn: [C: 031] "eh, i know gitweb replaced by gitblit, so +1 unless it makes us touch IRC bots :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 (owner: 10Demon) [19:18:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource, wikibooks and wikinews to 1.22wmf14 [19:18:43] Logged the message, Master [19:18:47] (03CR) 10Dzahn: [C: 032] Remove old ircbot and gitweb cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 (owner: 10Demon) [19:21:06] (03PS7) 10Demon: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - 10https://gerrit.wikimedia.org/r/75499 (owner: 10Hashar) [19:21:52] (03CR) 10Dzahn: [C: 032] "go go, 4 jenkins slaves" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75499 (owner: 10Hashar) [19:23:12] (03PS7) 10Demon: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/75500 (owner: 10Hashar) [19:27:49] (03CR) 10Aude: "this is ready" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 (owner: 10Aude) [19:28:14] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiquote to 1.22wmf14 [19:28:20] Logged the message, Master [19:34:53] (03PS1) 10Reedy: Non wikipedias to 1.22wmf14 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81044 [19:35:19] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.22wmf14 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81044 (owner: 10Reedy) [19:35:28] (03Merged) 10jenkins-bot: Non wikipedias to 1.22wmf14 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81044 (owner: 10Reedy) [19:45:32] ^d: mutante: I guess you two are working to setup the replications ? :) [19:47:19] !log reedy synchronized php-1.22wmf14/extensions/DataValues/ [19:47:24] Logged the message, Master [19:47:31] (03PS7) 10Ottomata: Puppetizing Hadoop JournalNode and Standby HA NameNode [operations/puppet] - 10https://gerrit.wikimedia.org/r/76722 [19:47:46] !log reedy synchronized php-1.22wmf14/extensions/Wikibase [19:47:52] Logged the message, Master [19:54:03] (03PS8) 10Ottomata: Puppetizing Hadoop JournalNode and Standby HA NameNode [operations/puppet] - 10https://gerrit.wikimedia.org/r/76722 [19:57:05] !log taking analytics1010 down in preparation for repaving of hadoop cluster [19:57:09] Logged the message, Master [19:59:20] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:37] (03CR) 10Yurik: Instruct robots to not index Wikipedia Zero. No deploy before 25-June-2013. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/69420 (owner: 10Dr0ptp4kt) [20:01:35] !log taking down analytics1011-analytics1020 for reinstall [20:01:40] Logged the message, Master [20:01:40] PROBLEM - DPKG on analytics1017 is CRITICAL: Timeout while attempting connection [20:01:40] PROBLEM - RAID on analytics1013 is CRITICAL: Timeout while attempting connection [20:01:40] PROBLEM - DPKG on analytics1014 is CRITICAL: Timeout while attempting connection [20:02:10] PROBLEM - DPKG on analytics1012 is CRITICAL: Timeout while attempting connection [20:02:21] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [20:02:21] PROBLEM - RAID on analytics1019 is CRITICAL: Timeout while attempting connection [20:02:21] PROBLEM - RAID on analytics1015 is CRITICAL: Timeout while attempting connection [20:02:30] PROBLEM - DPKG on analytics1015 is CRITICAL: Timeout while attempting connection [20:02:30] PROBLEM - DPKG on analytics1020 is CRITICAL: Timeout while attempting connection [20:03:20] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:21] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:21] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:21] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:21] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:21] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:50] PROBLEM - Host analytics1015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:03:50] PROBLEM - Host analytics1012 is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:03:50] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:04:00] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:30] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:06:33] PROBLEM - RAID on analytics1010 is CRITICAL: Connection refused by host [20:07:03] PROBLEM - SSH on analytics1010 is CRITICAL: Connection refused [20:07:23] PROBLEM - DPKG on analytics1010 is CRITICAL: Connection refused by host [20:07:24] PROBLEM - Disk space on analytics1010 is CRITICAL: Connection refused by host [20:08:33] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:08:33] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:08:33] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:08:33] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:08:33] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:08:34] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [20:08:38] (03PS2) 10Reedy: Enable data transclusion for wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 (owner: 10Aude) [20:08:54] (03CR) 10Reedy: [C: 032] Enable data transclusion for wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 (owner: 10Aude) [20:09:03] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:09:03] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:09:03] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [20:09:13] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [20:10:33] PROBLEM - SSH on analytics1016 is CRITICAL: Connection refused [20:10:33] PROBLEM - SSH on analytics1018 is CRITICAL: Connection refused [20:10:33] PROBLEM - Disk space on analytics1016 is CRITICAL: Connection refused by host [20:10:34] PROBLEM - Disk space on analytics1011 is CRITICAL: Connection refused by host [20:10:58] PROBLEM - SSH on analytics1014 is CRITICAL: Connection refused [20:11:03] PROBLEM - SSH on analytics1015 is CRITICAL: Connection refused [20:11:03] PROBLEM - Disk space on analytics1015 is CRITICAL: Connection refused by host [20:11:03] PROBLEM - DPKG on analytics1018 is CRITICAL: Connection refused by host [20:11:03] PROBLEM - Disk space on analytics1017 is CRITICAL: Connection refused by host [20:11:03] PROBLEM - SSH on analytics1013 is CRITICAL: Connection refused [20:11:04] PROBLEM - Disk space on analytics1018 is CRITICAL: Connection refused by host [20:11:04] PROBLEM - Disk space on analytics1012 is CRITICAL: Connection refused by host [20:11:05] PROBLEM - RAID on analytics1018 is CRITICAL: Connection refused by host [20:11:05] PROBLEM - Disk space on analytics1019 is CRITICAL: Connection refused by host [20:11:13] PROBLEM - SSH on analytics1019 is CRITICAL: Connection refused [20:11:13] PROBLEM - DPKG on analytics1011 is CRITICAL: Connection refused by host [20:11:13] PROBLEM - Disk space on analytics1013 is CRITICAL: Connection refused by host [20:11:13] PROBLEM - SSH on analytics1017 is CRITICAL: Connection refused [20:11:13] PROBLEM - RAID on analytics1014 is CRITICAL: Connection refused by host [20:11:23] PROBLEM - SSH on analytics1012 is CRITICAL: Connection refused [20:11:23] PROBLEM - DPKG on analytics1013 is CRITICAL: Connection refused by host [20:11:23] PROBLEM - DPKG on analytics1016 is CRITICAL: Connection refused by host [20:11:23] PROBLEM - SSH on analytics1020 is CRITICAL: Connection refused [20:11:23] PROBLEM - RAID on analytics1011 is CRITICAL: Connection refused by host [20:11:24] PROBLEM - RAID on analytics1016 is CRITICAL: Connection refused by host [20:11:24] PROBLEM - Disk space on analytics1020 is CRITICAL: Connection refused by host [20:11:25] PROBLEM - Disk space on analytics1014 is CRITICAL: Connection refused by host [20:11:25] PROBLEM - RAID on analytics1017 is CRITICAL: Connection refused by host [20:11:26] PROBLEM - SSH on analytics1011 is CRITICAL: Connection refused [20:11:26] PROBLEM - RAID on analytics1012 is CRITICAL: Connection refused by host [20:11:27] PROBLEM - DPKG on analytics1019 is CRITICAL: Connection refused by host [20:11:52] (03Merged) 10jenkins-bot: Enable data transclusion for wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80996 (owner: 10Aude) [20:17:31] !log reedy synchronized wmf-config/ [20:17:37] Logged the message, Master [20:18:33] PROBLEM - NTP on analytics1010 is CRITICAL: NTP CRITICAL: No response from NTP server [20:19:53] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [20:22:33] PROBLEM - NTP on analytics1011 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:03] PROBLEM - NTP on analytics1017 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:13] PROBLEM - NTP on analytics1014 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:13] PROBLEM - NTP on analytics1012 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:13] PROBLEM - NTP on analytics1019 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:23] PROBLEM - NTP on analytics1016 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:23] PROBLEM - NTP on analytics1013 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:23] PROBLEM - NTP on analytics1020 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:23] PROBLEM - NTP on analytics1018 is CRITICAL: NTP CRITICAL: No response from NTP server [20:23:33] PROBLEM - NTP on analytics1015 is CRITICAL: NTP CRITICAL: No response from NTP server [20:30:43] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:30:43] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [20:30:43] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:13] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:13] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:13] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:23] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:24] RECOVERY - SSH on analytics1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:31:33] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [20:31:33] RECOVERY - SSH on analytics1018 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:31:43] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:31:53] RECOVERY - SSH on analytics1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:03] RECOVERY - SSH on analytics1015 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:04] RECOVERY - SSH on analytics1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:04] (03CR) 10Ottomata: [C: 032 V: 032] Puppetizing Hadoop JournalNode and Standby HA NameNode [operations/puppet] - 10https://gerrit.wikimedia.org/r/76722 (owner: 10Ottomata) [20:32:13] RECOVERY - SSH on analytics1019 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:13] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:13] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:32:23] RECOVERY - SSH on analytics1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:23] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [20:32:34] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:32:34] RECOVERY - SSH on analytics1016 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:43] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [20:33:12] <^d> hashar: Yeah, gonna get that going for you finally :) [20:33:23] RECOVERY - SSH on analytics1020 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:33:33] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:34:46] (03CR) 10Dzahn: [C: 032] "now that we merged the replication to lanthanum, merge this as well. it will not actively do it until ^demon manually kicks it off" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75500 (owner: 10Hashar) [20:36:37] (03CR) 10Dzahn: "gallium and lanthanum both have /srv/ssd/gerrit/ which are empty for now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75500 (owner: 10Hashar) [20:36:58] ^d: the replications you are setting up are not going to be used this week, so if that causes any trouble feel free to shot them :) [20:38:21] (03PS1) 10Ori.livneh: Add Icinga plug-in & NRPE check for EventLogging jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/81121 [20:49:24] (03PS1) 10RobH: netmon1001 to base install for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/81123 [20:52:12] (03CR) 10Ori.livneh: [C: 031] "I think it's OK; we can account for it in our analyses. It'll be interesting to compare NavigationTiming latency measurements taken before" [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 (owner: 10Faidon) [20:52:45] (03CR) 10Dzahn: "interface::add_ip6_mapped" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81123 (owner: 10RobH) [20:55:28] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [20:56:58] <^d> hashar: Replication to /srv/ssd/git on gallium is fine almost. Need some tweaking for lanthanum. [20:57:02] <^d> *almost fine [20:58:14] (03PS2) 10RobH: netmon1001 to base install for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/81123 [20:58:35] (03CR) 10Ottomata: "Hm, looks good. I wonder if would be better to separate the services into different checks/alerts. You could still use the same script, " [operations/puppet] - 10https://gerrit.wikimedia.org/r/81121 (owner: 10Ori.livneh) [21:02:26] (03CR) 10RobH: [C: 032] netmon1001 to base install for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/81123 (owner: 10RobH) [21:04:07] (03PS1) 10Dzahn: fix confusing comments about apache sites being in /files while they are actually .erb templates in /templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/81129 [21:05:32] (03PS2) 10Dzahn: fix confusing comments about apache sites being in /files while they are actually .erb templates in /templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/81129 [21:08:01] Hello guys [21:08:44] we opened a bug requesting the installation of the Education Program extension on wiki pt [21:08:47] https://bugzilla.wikimedia.org/show_bug.cgi?id=52870 [21:08:58] someone can help us with that bug ? [21:09:12] It is not assigned yet [21:09:36] RodrigoPadula: yup :) [21:10:15] RodrigoPadula: that would be the platform engineering team which I am part of [21:11:05] RodrigoPadula: we are in a conf call right now and that bug is on our agenda ;) [21:11:13] (03PS2) 10Ori.livneh: Add Icinga plug-in & NRPE check for EventLogging jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/81121 [21:11:59] !log reedy synchronized php-1.22wmf14/extensions/Wikibase [21:13:53] (03CR) 10Ori.livneh: "> I wonder if would be better to separate the services into different checks/alerts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81121 (owner: 10Ori.livneh) [21:15:07] (03CR) 10Dzahn: [C: 04-1] "Akosiaris, no, i edited the wrong file. I meant etherpad_lite.wikimedia.org.erb to change the behaviour of the etherpad_lite and redirect " [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 (owner: 10Dzahn) [21:18:02] (03CR) 10Dzahn: "do we want to enforce it in etherpad_lite.wikimedia.org.erb? it's already like below:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 (owner: 10Dzahn) [21:19:26] (03CR) 10Dzahn: [C: 032] fix confusing comments about apache sites being in /files while they are actually .erb templates in /templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/81129 (owner: 10Dzahn) [21:25:20] (03CR) 10Asher: [C: 032 V: 032] add missing uploadlb6 ips [operations/puppet] - 10https://gerrit.wikimedia.org/r/81039 (owner: 10Asher) [21:30:06] (03CR) 10Ottomata: [C: 032 V: 032] Add Icinga plug-in & NRPE check for EventLogging jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/81121 (owner: 10Ori.livneh) [21:30:36] (03PS2) 10Dzahn: redirect http->https on etherpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 [21:39:13] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [21:48:25] (03PS2) 10Dzahn: puppetized motd of bots project of labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/53145 (owner: 10Petrb) [21:51:41] (03CR) 10Dzahn: [C: 032] puppetized motd of bots project of labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/53145 (owner: 10Petrb) [22:03:52] (03PS1) 10Dzahn: run cronjob for mw stats of wikki.com wiki at 9pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/81137 [22:06:00] (03CR) 10Dzahn: [C: 032] run cronjob for mw stats of wikki.com wiki at 9pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/81137 (owner: 10Dzahn) [22:08:11] PROBLEM - Disk space on analytics1014 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:11] PROBLEM - DPKG on analytics1019 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:11] PROBLEM - Disk space on analytics1020 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:21] PROBLEM - Disk space on analytics1019 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:21] PROBLEM - DPKG on analytics1018 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:31] PROBLEM - DPKG on analytics1011 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:31] PROBLEM - Disk space on analytics1018 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:31] PROBLEM - DPKG on analytics1017 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:41] PROBLEM - Disk space on analytics1011 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:41] PROBLEM - DPKG on analytics1016 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:41] PROBLEM - Disk space on analytics1017 is CRITICAL: NRPE: Command check_disk_space not defined [22:08:51] PROBLEM - DPKG on analytics1009 is CRITICAL: NRPE: Command check_dpkg not defined [22:08:51] PROBLEM - Disk space on analytics1016 is CRITICAL: NRPE: Command check_disk_space not defined [22:09:01] PROBLEM - Disk space on analytics1009 is CRITICAL: NRPE: Command check_disk_space not defined [22:09:01] PROBLEM - DPKG on analytics1014 is CRITICAL: NRPE: Command check_dpkg not defined [22:09:01] PROBLEM - DPKG on analytics1020 is CRITICAL: NRPE: Command check_dpkg not defined [22:12:11] RECOVERY - Disk space on analytics1014 is OK: DISK OK [22:17:01] RECOVERY - Disk space on analytics1009 is OK: DISK OK [22:19:21] RECOVERY - Disk space on analytics1019 is OK: DISK OK [22:19:41] RECOVERY - Disk space on analytics1011 is OK: DISK OK [22:22:41] RECOVERY - Disk space on analytics1017 is OK: DISK OK [22:23:12] (03CR) 10Ryan Lane: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [22:23:31] RECOVERY - Disk space on analytics1018 is OK: DISK OK [22:24:11] RECOVERY - Disk space on analytics1020 is OK: DISK OK [22:27:51] RECOVERY - Disk space on analytics1016 is OK: DISK OK [22:35:10] (03CR) 10MZMcBride: "> Patch Set 3: -Code-Review" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/69420 (owner: 10Dr0ptp4kt) [22:36:34] (03CR) 10MZMcBride: "Oh, this was a -2 removal, I believe, from skimming #wikimedia-operations scrollback. Strange Gerrit behavior. Never mind." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/69420 (owner: 10Dr0ptp4kt) [22:39:56] PROBLEM - DPKG on analytics1005 is CRITICAL: NRPE: Command check_dpkg not defined [22:39:56] PROBLEM - Disk space on analytics1006 is CRITICAL: NRPE: Command check_disk_space not defined [22:40:06] PROBLEM - Disk space on analytics1005 is CRITICAL: NRPE: Command check_disk_space not defined [22:40:36] PROBLEM - DPKG on analytics1008 is CRITICAL: NRPE: Command check_dpkg not defined [22:40:46] PROBLEM - DPKG on mw31 is CRITICAL: Timeout while attempting connection [22:40:46] PROBLEM - DPKG on analytics1006 is CRITICAL: NRPE: Command check_dpkg not defined [22:40:46] PROBLEM - Disk space on analytics1008 is CRITICAL: NRPE: Command check_disk_space not defined [22:42:06] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:26] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [22:46:25] !log fixed labsdb cpufreq governor setting (and got a 2x speedup on a large groupby query, woops!) [22:46:30] Logged the message, Master [22:47:37] <^d> !log restarting gerrit [22:47:42] Logged the message, Master [22:48:46] RECOVERY - Disk space on analytics1008 is OK: DISK OK [22:49:32] (03CR) 10Asher: [C: 032 V: 032] "Verified that bits isn't receiving ^/m/ requests." [operations/puppet] - 10https://gerrit.wikimedia.org/r/73342 (owner: 10MaxSem) [22:50:50] (03PS1) 10RobH: RT5673 virt13/14 dns update [operations/dns] - 10https://gerrit.wikimedia.org/r/81142 [22:54:46] (03CR) 10Asher: [C: 032 V: 032] generic::mysql::server: exec[] -> Exec[] [operations/puppet] - 10https://gerrit.wikimedia.org/r/77123 (owner: 10Hashar) [22:55:59] (03Abandoned) 10RobH: RT5673 virt13/14 dns update [operations/dns] - 10https://gerrit.wikimedia.org/r/81142 (owner: 10RobH) [22:56:01] (03PS3) 10Asher: Setup metrics collection for elasticserch [operations/puppet] - 10https://gerrit.wikimedia.org/r/78414 (owner: 10Manybubbles) [22:56:08] (03CR) 10Asher: [C: 032 V: 032] Setup metrics collection for elasticserch [operations/puppet] - 10https://gerrit.wikimedia.org/r/78414 (owner: 10Manybubbles) [22:56:48] (03PS1) 10RobH: RT5676 & fixing other tampa virt dns [operations/dns] - 10https://gerrit.wikimedia.org/r/81144 [22:57:01] just got a 504 gateway time-out on https://www.wikidata.org/w/index.php?limit=50&tagfilter=&title=Special%3AContributions&contribs=user&target=Dexbot&namespace=1&tagfilter=&year=2013&month=-1 [22:57:22] (03PS1) 10Asher: change elasticsearch collection period to every minute [operations/puppet] - 10https://gerrit.wikimedia.org/r/81145 [22:57:37] (03CR) 10RobH: [C: 032 V: 032] RT5676 & fixing other tampa virt dns [operations/dns] - 10https://gerrit.wikimedia.org/r/81144 (owner: 10RobH) [22:57:38] (03CR) 10Asher: [C: 032 V: 032] change elasticsearch collection period to every minute [operations/puppet] - 10https://gerrit.wikimedia.org/r/81145 (owner: 10Asher) [22:57:56] RECOVERY - Disk space on analytics1006 is OK: DISK OK [22:58:26] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [23:02:26] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.0004388093948 secs [23:04:16] (03PS1) 10CSteipp: Enable SUL2 in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81148 [23:05:06] RECOVERY - Disk space on analytics1005 is OK: DISK OK [23:07:31] (03PS1) 10Ori.livneh: Add pystatsd module; provision on hafnium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81149 [23:07:58] binasher: got a minute to review ^ ? [23:08:10] sure [23:08:25] cool, thanks [23:08:58] (03PS1) 10Dzahn: make wikistats cron jobs more readable, add comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/81150 [23:09:38] (03CR) 10Aaron Schulz: [C: 031] Enable SUL2 in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81148 (owner: 10CSteipp) [23:10:47] (03PS2) 10Dzahn: make wikistats cron jobs more readable, add comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/81150 [23:11:48] (03CR) 10Ryan Lane: "If they can handle it in the analysis, this looks good to me." [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 (owner: 10Faidon) [23:11:55] huh, the python-ss-statsd deb really doesn't come with an init / upstart script, weird [23:12:09] it does but it's broken [23:12:22] there are several things broken with it [23:12:46] i'll patch the package later tonight if i get the chance [23:12:50] ss stands for swiftstack [23:13:02] (03CR) 10Asher: [C: 032 V: 032] Add pystatsd module; provision on hafnium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81149 (owner: 10Ori.livneh) [23:13:11] this isn't a standard package or anything [23:13:37] paravoid: standard or not, it's broken: TypeError: __init__() got an unexpected keyword argument 'graphite_host' [23:13:42] this was a thing we have in the repo because ben put it there as part of the swift deployment more than a year ago [23:13:51] and I'm pretty sure it hasn't been updated since [23:13:54] :/ [23:13:58] it should convert the 'graphite_host' command-line argument to the 'host' kwarg for graphite, but instead it maps it to 'graphite_host' [23:14:05] I wouldn't use it as-is [23:14:06] I'll edit by hand for now [23:14:23] lololol - paravoid awakes at the mention of swiftstack :) [23:14:27] :P [23:14:50] paravoid: hrm, not even temporarily? professor is in tampa and is going to get decom'd soon anyway, i'm basically experimenting. i configured it to listen on localhost only [23:15:15] * Aaron|home detects paravoid [23:15:22] I'm not here [23:15:40] great, 'cause I'm not listening :P [23:15:45] mark banned me from the channel today for being too active on my supposedely relaxing week [23:15:54] ori-l: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Class[Misc::Graphite::Pystatsd] is already defined; cannot redefine at /etc/puppet/manifests/misc/graphite.pp:119 on node hafnium.wikimedia.org [23:16:01] ori-l: i need to run for now though [23:16:27] argh, i'll fix, sorry [23:16:31] paravoid: right...so what made ceph fall over last deploy? [23:16:41] sigh [23:16:52] it worked fine for over 3 hours [23:17:19] then suddenly radosgw requests were timing out and throwing 500s three times over the course of three hours or so [23:17:29] no warnings, alerts, or other issues [23:17:32] nothing in the logs [23:17:36] nothing the devs could think of [23:17:54] (apart from increasing the threadpool size, which I did before they even suggested it) [23:18:04] (03PS1) 10Ori.livneh: Fully-qualify '::pystatsd' in misc::graphite::pystatsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/81154 [23:18:13] and it still works as a secondary since then [23:18:16] really weird [23:18:18] I was meaning to talk to you about resyncing in that case, since it's trickier due to switching around [23:18:29] alright, who wants to merge a two-character diff? [23:18:34] not paravoid, because he's not here [23:19:12] I'm starting to get a bit exhausted by all this debugging [23:19:23] just think of the boost it'll give your commit stats on ohloh [23:19:24] Not Peter, because he's not here. [23:19:38] Elsie: :( [23:19:41] too soon. [23:19:48] Oh, did he actually leave? [23:19:58] fortunately, at least on commons, there were no unsynced files that were updated, which is the hardest case (pick the one with the highest timestamp) [23:20:10] RIP, notpeter. [23:20:18] he didnt die. [23:20:27] Aaron|home: uhm, I'm a bit confused [23:20:29] hes at burning man though, so i guess it could happen. [23:20:31] so just running a copy of totally missing files in both directions and purgeDeletedFiles from when ceph was promoted should do it [23:20:51] I reverted just the master swithcover [23:20:53] *switchover [23:21:04] I didn't remove it from being a backend [23:21:20] (03PS3) 10Ryan Lane: Restructure replication in preparation of moving off manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/80489 (owner: 10Demon) [23:21:20] writes are going to both backends synchronously, right? why would that be a problem? [23:21:36] well, some stuff always fails selectively, that's the only problem [23:22:01] like operations errors shown in filebackend-ops log [23:22:02] Ryan_Lane: can you merge https://gerrit.wikimedia.org/r/#/c/81154/ ? [23:22:20] in theory if writes always replicated 100% none of this would matter [23:22:27] right [23:22:57] ori-l: is pystatsd at a global level? [23:22:59] heh, the script I have for finding updates to existing files a time frame is just eval_body.php on my home dir...yeah yeah [23:23:00] (03CR) 10Dzahn: [C: 032] Fully-qualify '::pystatsd' in misc::graphite::pystatsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/81154 (owner: 10Ori.livneh) [23:23:07] Ryan_Lane: yep [23:23:08] it's not a module or anything? [23:23:10] * Aaron|home should run that on all wikis to double check [23:23:13] it's a module [23:23:32] why does it need to be fully qualified? [23:23:43] is this some new weird puppetism I don't know about? [23:23:50] 'The authenticity of host 'terbium (10.64.32.13)' can't be established.' [23:23:52] * Aaron|home hmms [23:23:52] because the role class that is including it is called 'class misc::graphite::pystatsd' [23:24:00] so it interprets 'pystatsd' as a reference to itself [23:24:02] ugh. right [23:24:10] such a stupid puppet thing [23:24:24] mutante already merged it for you [23:24:33] i just realized [23:24:34] thanks mutante [23:24:41] ori-l: https://github.com/armon/statsite was the one I was looking at last time [23:25:09] paravoid: CollectD's most recent version (released four days ago) has a StatsD server implemented in C built-in [23:25:26] collectd is generally reliable and well-maintained and debian folks seem to like it, so maybe that's the way to go [23:25:35] but i'll check out statsite [23:25:53] bucky is another thing probably worthy to have a look at [23:25:53] np, byt the way, are are supposed to use more puppet stdlib, check out ensure_packages and ensure_resource, they create but only if doesn't exist already and should be less duplicate definition problems [23:25:59] https://github.com/cloudant/bucky [23:26:11] https://forge.puppetlabs.com/puppetlabs/stdlib [23:26:19] [16:19] paravoid I'm starting to get a bit exhausted by all this debugging [23:26:25] paravoid: in general or right now? [23:26:30] in general, with ceph [23:26:49] mutante: yeah, i don't love the stdlib :P [23:27:13] because it's not "really" standard (i.e., not built-in) but masquerades as such [23:27:20] I'm one bug short of 50 [23:27:32] bug report I mean [23:27:37] that's kind of a lot [23:28:24] ori-l: so bucky is being used by dreamhost, supposedely [23:28:36] it's your ceph golden jubilee [23:29:21] ori-l: ok, but we have ./modules/stdlib/ [23:29:22] ori-l: if you promise to help with the statsd deployment I promise I'll package properly whichever of the daemons is best suited for us [23:29:38] (pystatsd/statsite/bucky/something else) [23:29:42] paravoid: deal [23:29:47] as long as it's not the node.js one :P [23:29:51] i'm pretty agnostic actually [23:30:05] (03CR) 10Ryan Lane: [C: 032] Restructure replication in preparation of moving off manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/80489 (owner: 10Demon) [23:30:10] it listens on a socket and writes to graphite, the rest can be a black box as far as i'm concerned [23:30:11] ^d: ^^ [23:30:40] mutante: well ,you're probably right [23:30:49] (03CR) 10Dzahn: [C: 032] make wikistats cron jobs more readable, add comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/81150 (owner: 10Dzahn) [23:31:00] paravoid: "Don't Stop Believin'" ;) [23:31:06] i'll check out ensure_{package,source} [23:32:02] <^d> Ryan_Lane: Ok, I'm finishing up cloning everything to lanthanum, then we'll do this fun :) [23:32:19] heh [23:32:32] <^d> Actually, you'll have to run puppet on ytterbium the first time. I still have no account :) [23:35:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.051 second response time [23:43:15] err: Failed to apply catalog: Could not find dependency Group[500] for User[demon] at /etc/puppet/manifests/admins.pp:37 [23:48:59] <^d> :\ [23:50:08] office network issues [23:50:49] (03PS1) 10Demon: Remove deprecated ircbot param [operations/puppet] - 10https://gerrit.wikimedia.org/r/81158 [23:50:49] o rly? [23:51:16] wired network disconnected [23:51:54] <^d> mutante: Where'd you see that dependency error? [23:53:15] ^d: ytterbium, fixing it [23:53:36] you have a sudo user but no account [23:53:44] (03CR) 10Demon: [C: 031] "I187e39e0 should've been amended before submitting to take I134fa740 into account." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81158 (owner: 10Demon) [23:53:55] <^d> ^ That fixes puppet on manganese. [23:54:39] ah, path conflict [23:55:29] makes sense @ircbot [23:56:03] (03CR) 10Dzahn: [C: 032] Remove deprecated ircbot param [operations/puppet] - 10https://gerrit.wikimedia.org/r/81158 (owner: 10Demon)