[00:00:05] do I even need to sync that? [00:00:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:05] bd808|BUFFER: 'Started update apaches' sounds a bit funny [00:15:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 469866 bytes in 9.717 second response time [00:18:36] !log ori Finished scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1 and 1.23wmf22 (duration: 32m 53s) [00:18:43] Logged the message, Master [00:18:54] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:20:00] (03PS3) 10Ori.livneh: Enable GlobalCssJs on testwiki & test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 (owner: 10Legoktm) [00:20:02] (03CR) 10Ori.livneh: [C: 032] Enable GlobalCssJs on testwiki & test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 (owner: 10Legoktm) [00:20:04] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:20:11] (03Merged) 10jenkins-bot: Enable GlobalCssJs on testwiki & test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127178 (owner: 10Legoktm) [00:20:52] !log ori updated /a/common to {{Gerrit|Ie9b265be9}}: Enable GlobalCssJs on testwiki & test2wiki [00:20:58] Logged the message, Master [00:21:20] !log ori synchronized wmf-config/InitialiseSettings.php 'Ie9b265be9: Enable GlobalCssJs on testwiki & test2wiki (1/2)' [00:21:26] Logged the message, Master [00:21:37] !log ori synchronized wmf-config/CommonSettings.php 'Ie9b265be9: Enable GlobalCssJs on testwiki & test2wiki (2/2)' [00:21:43] Logged the message, Master [00:23:06] mwalker|away: I'm not sure what needs to be done to deploy Ib984e9820, so I'm skipping it, sorry. [00:23:37] [00:27:54] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1433.133301 [00:28:05] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1877.800049 [00:28:54] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:29:05] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:39:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 471791 bytes in 9.731 second response time [00:53:10] ori, *nods* I was distracted by gwicke :p [00:53:14] I'll deploy it monday [00:53:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 471127 bytes in 9.782 second response time [00:59:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 471123 bytes in 9.885 second response time [01:05:04] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [01:10:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:44] (03CR) 10Jeremyb: "(in reply to Dzahn 04-15 15:49)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [01:12:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 470425 bytes in 9.847 second response time [01:15:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 470425 bytes in 9.792 second response time [01:46:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 467693 bytes in 9.595 second response time [01:56:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 465980 bytes in 9.826 second response time [02:11:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:12:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 463858 bytes in 9.753 second response time [02:13:04] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3082 MB (3% inode=99%): [02:18:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:04] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3748 MB (3% inode=99%): [02:19:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 463855 bytes in 9.627 second response time [02:30:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 463857 bytes in 9.781 second response time [02:39:53] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-18 02:39:51+00:00 [02:40:01] Logged the message, Master [03:01:04] RECOVERY - Disk space on virt0 is OK: DISK OK [03:06:08] !log LocalisationUpdate completed (1.24wmf1) at 2014-04-18 03:06:06+00:00 [03:06:15] Logged the message, Master [03:24:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454319 bytes in 9.749 second response time [03:37:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 453369 bytes in 9.883 second response time [03:51:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 452141 bytes in 9.688 second response time [03:56:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:59:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 452140 bytes in 9.603 second response time [04:04:04] mwalker: I am told that we have the correct puppet/ruby packages for trusty already in the repo so you should be able to spin up a labs instance on it without hassle [04:04:26] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 18 04:04:21 UTC 2014 (duration 4m 20s) [04:04:32] Logged the message, Master [04:06:04] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [04:12:40] (03CR) 10MaxSem: "Note that $wgMFRemovableClasses doesn't control extracts anymore, so this change needs to adappt to post https://gerrit.wikimedia.org/r/12" (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna) [04:13:39] apergos, do you know when those became available? [04:13:59] I'm just trying to figure out why Ryan would have been unable to resolve the conflicts he was seeing when he tried [04:14:25] I think he wasn't relying on those packages [04:19:09] on copper (running trusty) I see puppet 2.7.11 and ruby 1.8 which is consistent with our precise setup [04:19:51] so it should 'just work' [04:28:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:29:05] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454558 bytes in 9.275 second response time [04:43:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454736 bytes in 9.869 second response time [04:49:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454729 bytes in 9.658 second response time [04:58:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454730 bytes in 9.718 second response time [05:05:49] a typical retreival of the index page for gitblit (dynamically generated) takes between 9 and 12 seconds now it seems, the check_http(s) cutoff is 10 [05:06:32] having it retrive something a little lighter weight would be nice, if there is a good option [05:06:35] * apergos pokes around [05:08:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 454748 bytes in 9.823 second response time [05:13:09] (03PS1) 10ArielGlenn: change gitblit url check to retrieve something lightweight [operations/puppet] - 10https://gerrit.wikimedia.org/r/127204 [05:15:30] (03CR) 10Dzahn: [C: 031] change gitblit url check to retrieve something lightweight [operations/puppet] - 10https://gerrit.wikimedia.org/r/127204 (owner: 10ArielGlenn) [05:16:14] (03CR) 10ArielGlenn: [C: 032] change gitblit url check to retrieve something lightweight [operations/puppet] - 10https://gerrit.wikimedia.org/r/127204 (owner: 10ArielGlenn) [05:17:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:32] hush you [05:23:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 456057 bytes in 9.577 second response time [05:26:11] "GET /tree/mediawiki%2Fcore.git HTTP/1.1" 200 58374 T=0s [05:26:26] that's more like it [05:26:45] cool, yep [05:50:24] (03CR) 10Chad: "We could go even more lightweight than mw/core. How about operations/puppet?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127204 (owner: 10ArielGlenn) [06:01:32] ^demon|away: it already takes 0s, but change if you like :-) [06:08:54] (03CR) 10Dzahn: [C: 032] remove admins::restricted from lucene role [operations/puppet] - 10https://gerrit.wikimedia.org/r/126939 (owner: 10Dzahn) [06:29:34] (03PS2) 10Chad: New wikis done building [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126806 [06:29:40] (03CR) 10Chad: [C: 032] New wikis done building [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126806 (owner: 10Chad) [06:29:48] (03Merged) 10jenkins-bot: New wikis done building [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126806 (owner: 10Chad) [06:31:06] !log demon synchronized wmf-config/InitialiseSettings.php 'Next round of wikis done building Cirrus indexes, throw into beta mode' [06:31:12] Logged the message, Master [06:33:18] springle: removing m1-master.pmtpa.wmnet is also harmless, right.. it pointed to db35 which is now down [06:33:59] there are still s1-secondary, s5-secondary, m2-secondary [06:34:12] all talking about the DNS entries [06:34:54] those are db63,db73,db48 [06:36:20] (03PS2) 10ArielGlenn: add wiktionary.eu, link to wiktionary.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126932 (owner: 10Dzahn) [06:37:08] (03CR) 10ArielGlenn: [C: 032] add wiktionary.eu, link to wiktionary.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126932 (owner: 10Dzahn) [06:37:19] :) [06:38:38] (03CR) 10Aklapper: [C: 031] "My guts say Yes, but any reference for "Mozilla recommends it"?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [06:39:34] (03CR) 10Dzahn: "Aklapper, reference: https://wiki.mozilla.org/Security/Server_Side_TLS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [06:40:46] (03CR) 10Dzahn: "well, what they do in the "Apache" section there." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [06:41:45] (03CR) 10Dzahn: "or even more strict as in https://www.insecure.ws/2013/10/11/ssltls-configuration-for-apache-mod_ssl/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [06:42:13] (03CR) 10Dzahn: "If you do not enable RC4 or 3DES (“old” clients may not be able to connect!):" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [06:48:16] (03CR) 10ArielGlenn: [C: 032] Redirect wiktionary.eu to www.wiktionary.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126937 (owner: 10Odder) [06:55:11] mutante: yes, we can remove those pmtpa dns cnames [06:56:58] (03CR) 10Springle: [C: 031] decom, remove db35,db38 [operations/dns] - 10https://gerrit.wikimedia.org/r/126972 (owner: 10Dzahn) [06:57:19] (03CR) 10Dzahn: [C: 032] decom, remove db35,db38 [operations/dns] - 10https://gerrit.wikimedia.org/r/126972 (owner: 10Dzahn) [06:57:49] springle: thx, done [06:59:43] (03CR) 10Dzahn: [C: 032] remove rendering.pmtpa,rendering.svc.pmtpa [operations/dns] - 10https://gerrit.wikimedia.org/r/126971 (owner: 10Dzahn) [07:06:41] PROBLEM - Puppet freshness on db1056 is CRITICAL: Last successful Puppet run was Wed 16 Apr 2014 06:54:47 AM UTC [07:07:33] (03PS3) 10Springle: Remove mysql client from bastionhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [07:16:40] (03CR) 10Dzahn: "works. testing 2 urls on 190 servers, totalling 380 requests" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/126937 (owner: 10Odder) [07:20:04] (03PS2) 10Dzahn: remove arptest [operations/dns] - 10https://gerrit.wikimedia.org/r/125950 [07:20:59] (03PS3) 10Dzahn: remove arptest [operations/dns] - 10https://gerrit.wikimedia.org/r/125950 [07:24:33] (03PS1) 10Dzahn: remove api.svc.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/127209 [07:24:36] (03CR) 10jenkins-bot: [V: 04-1] remove api.svc.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/127209 (owner: 10Dzahn) [07:25:03] (03PS2) 10Dzahn: remove api.svc.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/127209 [07:32:19] (03PS1) 10Dzahn: remove Tampa appserver mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/127210 [07:37:39] (03CR) 10ArielGlenn: [C: 031] remove api.svc.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/127209 (owner: 10Dzahn) [07:37:55] (03PS2) 10Dzahn: remove Tampa appserver reverse DNS and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/127210 [07:43:24] (03CR) 10Dzahn: [C: 032] remove api.svc.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/127209 (owner: 10Dzahn) [07:56:21] (03CR) 10Dzahn: [C: 032] remove arptest [operations/dns] - 10https://gerrit.wikimedia.org/r/125950 (owner: 10Dzahn) [07:57:25] !log DNS update - remove api.svc, arptest.pmtpa .. [07:57:31] Logged the message, Master [07:58:37] (03PS3) 10ArielGlenn: add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 [08:00:10] (03CR) 10ArielGlenn: [C: 032] add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 (owner: 10ArielGlenn) [08:06:23] (03CR) 10Springle: [C: 031] "I don't disagree with this since I either tunnel or use mysql directly on the db boxes themselves. However couple notes:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [08:11:12] PROBLEM - NTP peers on dobson is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [08:13:12] RECOVERY - NTP peers on dobson is OK: NTP OK: Offset 0.000135 secs [08:20:38] (03PS2) 10Dzahn: Add ttf-kochi-mincho and ttf-kochi-gothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/126729 (owner: 10Reedy) [08:21:01] PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [08:23:01] RECOVERY - NTP peers on linne is OK: NTP OK: Offset -0.005693 secs [08:25:23] (03CR) 10Dzahn: [C: 032] Add ttf-kochi-mincho and ttf-kochi-gothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/126729 (owner: 10Reedy) [08:27:06] (03CR) 10Dzahn: "notice: /Stage[main]/Imagescaler::Packages::Fonts/Package[ttf-kochi-gothic]/ensure: ensure changed 'purged' to 'latest'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126729 (owner: 10Reedy) [08:30:16] Reedy: [08:30:19] root@mw1153:~# fc-match 'Kochi Micho' [08:30:19] DejaVuSans.ttf: "DejaVu Sans" "Book" [08:30:32] fc-match 'Kochi Gothic' [08:30:32] kochi-gothic-subst.ttf: "Kochi Gothic" "Regular" [08:31:20] ah, "micho" != "mincho" [08:31:27] fc-match 'Kochi Mincho' [08:31:27] kochi-mincho-subst.ttf: "Kochi Mincho" "Regular" [08:34:52] (03CR) 10Dzahn: "this would be good to have for "#5148: move Torrus away from manutius" one remaining Tampa blocker" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108498 (owner: 10Matanya) [08:37:33] (03PS1) 10Springle: MHA site-switch templates are broken with less than two available DCs, and technically useless in this situation anyway. Disable them until we get a replacement for PMTPA that could actually handle a switch over. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127212 [08:43:42] (03CR) 10Dzahn: "that file has been renamed by Coren in Change-Id: If985e506d5b1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106907 (owner: 10Stwalkerster) [08:45:01] (03PS1) 10Hashar: contint: apply beta natfix on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/127213 [08:46:33] (03CR) 10Dzahn: "the line is still " 88 proxy_set_header X-Forwarded-For $remote_addr;" though. just now it's in domainproxy.conf" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106907 (owner: 10Stwalkerster) [08:47:49] (03Abandoned) 10Dzahn: nrpe: enable on virt0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/107424 (owner: 10Gage) [08:51:35] (03CR) 10Hashar: [C: 031 V: 032] "Applied on contint puppetmaster. Both slaves are still reachable from gallium and they now manage to contact the beta cluster entries suc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127213 (owner: 10Hashar) [08:52:08] (03CR) 10Dzahn: [C: 04-2] "matanya, i think this can be abandoned then" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119488 (owner: 10Matanya) [08:53:44] (03CR) 10Dzahn: [C: 04-2] "taking the liberty to abandon this, because Peter wrote it, Mark voted it down and Gabriel said he doesn't need it anymore" [operations/puppet] - 10https://gerrit.wikimedia.org/r/72653 (owner: 10Pyoungmeister) [08:54:00] (03Abandoned) 10Dzahn: proposal for allowing gabriel sudo access for varnishadm for parsoid caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/72653 (owner: 10Pyoungmeister) [08:59:58] (03PS1) 10ArielGlenn: Revert "add ntp servers on eeden.esams, rubidium (rt #7101)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127214 [09:00:15] (03PS2) 10ArielGlenn: Revert "add ntp servers on eeden.esams, rubidium (rt #7101)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127214 [09:03:43] (03CR) 10Dzahn: [C: 031] toollabs: Add expect to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 (owner: 10Yuvipanda) [09:07:10] (03CR) 10ArielGlenn: [C: 032] Revert "add ntp servers on eeden.esams, rubidium (rt #7101)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127214 (owner: 10ArielGlenn) [09:08:04] (03CR) 10Dzahn: [C: 04-1] "can we use networks from class network::constants here instead of listing networks?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117674 (owner: 10Matanya) [09:09:14] (03CR) 10Dzahn: [C: 031] Tools: Install package libxml2-utils for xmllint [operations/puppet] - 10https://gerrit.wikimedia.org/r/120187 (owner: 10Tim Landscheidt) [09:10:13] PROBLEM - NTP peers on dobson is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [09:12:03] PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [09:12:54] they'll be back shortly [09:13:13] RECOVERY - NTP peers on dobson is OK: NTP OK: Offset -0.000587 secs [09:14:03] RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.003985 secs [09:15:24] (03CR) 10Dzahn: "please fix the path conflict" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119438 (owner: 10Tim Landscheidt) [09:17:03] (03CR) 10Dzahn: [C: 04-2] cache: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111787 (owner: 10Matanya) [09:19:48] (03CR) 10Dzahn: [C: 031] Describe Math related packages in a class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (owner: 10Hashar) [09:21:21] (03PS2) 10Springle: MHA site-switch templates are broken with less than two available DCs, and technically useless in this situation anyway. Disable them until we get a replacement for PMTPA that could actually handle a switch over. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127212 [09:23:30] (03CR) 10Springle: [C: 032] MHA site-switch templates are broken with less than two available DCs, and technically useless in this situation anyway. Disable them until [operations/puppet] - 10https://gerrit.wikimedia.org/r/127212 (owner: 10Springle) [09:25:03] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Fri Apr 18 09:24:57 UTC 2014 [09:30:28] (03CR) 10ArielGlenn: "If these are node scope aren't they covered? See" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111787 (owner: 10Matanya) [09:34:58] (03CR) 10Aklapper: [C: 031] bugzilla, use better SSL cipher suite [operations/puppet] - 10https://gerrit.wikimedia.org/r/126205 (owner: 10Dzahn) [09:50:18] (03PS1) 10Dzahn: add new Tech News atom feed to Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 [10:01:15] (03CR) 10Nemo bis: [C: 031] "Soon in your language!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 (owner: 10Dzahn) [10:06:22] !log Upgrading Jenkins to latest LTS version 1.532.3 [10:06:29] Logged the message, Master [10:10:05] !log Jenkins upgraded to 1.532.3. [10:10:11] Logged the message, Master [10:10:15] apergos: only 4 minutes \O/ Thank you very much. [10:10:26] yw [10:24:48] (03CR) 10Matanya: Pass puppet-lint on realm.pp (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127138 (owner: 10Hashar) [10:28:03] matanya: what do you mean by nesting at https://gerrit.wikimedia.org/r/#/c/127138/1/manifests/realm.pp ? [10:28:09] (03Abandoned) 10Matanya: openstack: qualify var [operations/puppet] - 10https://gerrit.wikimedia.org/r/119488 (owner: 10Matanya) [10:28:12] hi hashar [10:28:18] oh and hi :-] [10:28:24] on a very broken connection [10:29:08] שָׁלוֹם [10:29:31] my hebrew is as good as copy pasting from https://en.wikipedia.org/wiki/Jewish_greetings [10:35:12] sorry hashar did you see my reply ? [10:35:29] matanya: nop [10:35:44] https://dpaste.de/3Kwz [10:35:51] ack [10:36:02] more readable i think [10:36:27] ohh [10:37:20] (03CR) 10Hashar: Pass puppet-lint on realm.pp (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127138 (owner: 10Hashar) [10:37:38] matanya: thanks :] [10:37:42] (03PS2) 10Hashar: Pass puppet-lint on realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127138 [11:02:00] (03CR) 10Matanya: puppet-lint role/nova.pp (0316 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/127147 (owner: 10Hashar) [11:05:06] (03CR) 10Odder: [C: 031] add new Tech News atom feed to Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 (owner: 10Dzahn) [11:06:47] (03CR) 10Dzahn: [C: 032] add new Tech News atom feed to Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 (owner: 10Dzahn) [11:35:03] (03PS1) 10Hashar: zuul: compress log daily [operations/puppet] - 10https://gerrit.wikimedia.org/r/127230 [11:37:45] (03PS3) 10ArielGlenn: contint: gives access to Bryan Davis [operations/puppet] - 10https://gerrit.wikimedia.org/r/126155 (owner: 10Hashar) [11:39:33] (03CR) 10ArielGlenn: [C: 032] contint: gives access to Bryan Davis [operations/puppet] - 10https://gerrit.wikimedia.org/r/126155 (owner: 10Hashar) [11:40:35] (03PS2) 10Hashar: contint: compress Jenkins console logs once per day [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 [11:41:12] (03CR) 10Hashar: contint: compress Jenkins console logs once per day (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 (owner: 10Hashar) [11:45:22] (03CR) 10JanZerebecki: [C: 031] "Refusing users that only have support for less secure protocols (like max. SSL3 for IE6 on Windows XP) can still be done in an additional " [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [11:48:56] !log removing mw-jenkinsbot (the wikimedia jenkins installation) from #wikimedia-labs [11:49:02] Logged the message, Master [12:11:56] (03CR) 10Hoo man: "> Removing the mysql client, given it's merely a utility and not a service, won't really affect security, traffic, or load. Just saying." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [12:26:41] (03CR) 10Hoo man: [C: 031] "One step at a time :) We maybe also need a new group which is like the old restricted (but below mortals)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [12:31:18] (03CR) 10Dzahn: [C: 032] bugzilla, use better SSL cipher suite [operations/puppet] - 10https://gerrit.wikimedia.org/r/126205 (owner: 10Dzahn) [12:32:27] (03PS2) 10Dzahn: bugzilla, use SSLProtocol ALL -SSLv2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 [12:34:27] (03CR) 10Dzahn: [C: 032] bugzilla, use SSLProtocol ALL -SSLv2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [12:35:38] (03CR) 10Hoo man: [C: 031] remove sudo::appserver from bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 (owner: 10Dzahn) [12:41:21] (03CR) 10Dzahn: "TLS 1.2 Yes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126206 (owner: 10Dzahn) [12:42:32] (03CR) 10Dzahn: "no more RC4" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126205 (owner: 10Dzahn) [12:43:50] (03CR) 10Springle: "> Well, it encourages users to misuse bastions, which *can* be quite risky if someone gains access (eg. publicly readable .my.cnf, passwor" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [12:44:26] (03PS1) 10ArielGlenn: turn off rsyncs to/from dataset2, prep for 12th floor move [operations/puppet] - 10https://gerrit.wikimedia.org/r/127235 [13:06:56] !log Bugzilla Apache, changed SSL cipher suite in I7e9adc182dc ,might cost a a few % performance but zirconium had plenty [13:07:02] Logged the message, Master [13:13:36] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 [13:14:11] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [13:14:37] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [13:15:49] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [13:16:04] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [13:16:08] (03PS1) 10Dzahn: remove all Tampa ms-be swift boxes from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127237 [13:17:41] paravoid: ^ are they going to be reinstalled? [13:18:09] ms-be [13:24:32] (03PS1) 10Dzahn: remove ms-be-1-12 from DHCP, netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/127239 [13:25:41] they need wiping [13:25:41] (03PS2) 10Dzahn: remove all Tampa ms-be swift boxes from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127237 [13:25:55] non-destructive wiping that is [13:26:21] (03CR) 10coren: [C: 032] "We need legal review only in cases where the exposed information contains or potentially contains non-public information. Inspection of t" [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [13:27:16] haha [13:27:21] better specify now [13:28:40] paravoid: ok, they dont need DHCP for that, was just wondering about netboot, partman recipe and stuff [13:28:56] correc [13:28:59] correct [13:29:26] (03PS2) 10Prtksxna: TextExtracts: Add classes and elements to the exclusion list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 [13:30:05] cmjohnson1 asked me to do the decom asap.. are you still copying files? [13:30:16] no [13:30:16] should i simply shutdown as well [13:30:19] ok [13:30:27] I'm copying files from eqiad to esams [13:30:31] never used tampa for that [13:30:36] k [13:31:30] (03CR) 10Dzahn: [C: 032] remove all Tampa ms-be swift boxes from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127237 (owner: 10Dzahn) [13:33:09] (03CR) 10Prtksxna: "I've adapted the changes to https://gerrit.wikimedia.org/r/127170 and added comments." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna) [13:33:15] (03PS2) 10Dzahn: remove ms-be-1-12 from DHCP, netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/127239 [13:33:20] (03CR) 10Prtksxna: TextExtracts: Add classes and elements to the exclusion list (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna) [13:33:40] mutante: [13:33:42] also ms-fe [13:34:12] paravoid: ok [13:34:14] and also, ms-be.cfg & ms-be-ssd.cfg shouldn't be needed anymore [13:34:24] thanks [13:34:27] so ditch those two [13:34:39] off now, ttyl [13:34:46] cya [13:54:23] !log ms-be1-12 - removing from puppet,salt,icinga [13:54:29] Logged the message, Master [13:58:57] PROBLEM - Host ps1-a3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [13:59:07] PROBLEM - Host ps1-a5-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [13:59:27] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:01:35] (03CR) 10Dzahn: [C: 032] remove ms-be-1-12 from DHCP, netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/127239 (owner: 10Dzahn) [14:12:27] (03PS1) 10Dzahn: remove ms-fe[14] from DHCP,remove partman recipes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127244 [14:12:44] (03CR) 10jenkins-bot: [V: 04-1] remove ms-fe[14] from DHCP,remove partman recipes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127244 (owner: 10Dzahn) [14:13:06] (03PS2) 10Dzahn: remove ms-fe[14] from DHCP,remove partman recipes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127244 [14:16:21] (03CR) 10Dzahn: [C: 032] remove ms-fe[14] from DHCP,remove partman recipes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127244 (owner: 10Dzahn) [14:19:15] (03PS1) 10Dzahn: remove ms-fe[14] from puppet, decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/127245 [14:20:10] (03CR) 10Dzahn: [C: 032] remove ms-fe[14] from puppet, decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/127245 (owner: 10Dzahn) [14:24:17] !log ms-fe[14] - stop puppet,revoke certs,remove icinga [14:24:22] Logged the message, Master [14:25:11] (03PS1) 10ArielGlenn: move the puppet snmptrap into a class so it can be run in last stage [operations/puppet] - 10https://gerrit.wikimedia.org/r/127246 [14:27:05] manybubbles: mornin! [14:27:27] ottomata: morning! [14:27:31] time for 1007 & 1008 [14:27:32] ? [14:31:25] PROBLEM - Host es6 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:34] PROBLEM - Host es5 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:24] (03PS1) 10Matanya: swift: remove swift role from tampa [operations/puppet] - 10https://gerrit.wikimedia.org/r/127247 [14:33:40] (03CR) 10jenkins-bot: [V: 04-1] swift: remove swift role from tampa [operations/puppet] - 10https://gerrit.wikimedia.org/r/127247 (owner: 10Matanya) [14:36:02] manybubbles: 1007 & 1008? shall I start? [14:36:11] sure! [14:36:15] was just in a meeting but donw now [14:36:18] done now [14:36:19] ah ok, moving shards off [14:37:22] !log ms-be 1-12, Tampa Swift boxes, shutdown [14:37:27] Logged the message, Master [14:38:04] grr..es5 and es6 are throwing icinga msgs ...should've have been decom'd [14:41:40] !log disabling puppet on stat1 for decom [14:41:46] Logged the message, Master [14:43:13] !log ms-fe[14] - shutting down [14:43:19] Logged the message, Master [14:46:57] (03PS1) 10Ottomata: Removing references to stat1, adding stat1 to decomissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127250 [14:47:26] mutante: you saw my reply on ssl chain? [14:48:00] jeremyb: no [14:48:28] ottomata: while I have you, elastic1001 is reporting down in ganglia [14:48:39] mutante: https://gerrit.wikimedia.org/r/111386 [14:50:49] (03CR) 10Ottomata: [C: 032 V: 032] Removing references to stat1, adding stat1 to decomissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/127250 (owner: 10Ottomata) [14:51:38] there it goes.... [14:51:45] !log powering down stat1 for decom [14:51:51] Logged the message, Master [14:54:05] !log es5,es6 - revoke puppet certs, salt keys, icinga [14:54:11] Logged the message, Master [14:54:18] ottomata: I'm going to step out for about 45 minutes. ping you can call me if anything blows up but I think its all pretty normal stuff. BTW, the current cluster master is 1002. 1001 was the master when you restarted it so I don't even think we'll get another master election during this process [14:54:47] ACKNOWLEDGEMENT - Host es5 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #6266 [14:54:48] ACKNOWLEDGEMENT - Host es6 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #6266 [14:56:09] cool, yeha, no probs [14:56:13] wait, what? [14:56:19] 1001 came back as a master? [14:56:26] manybubbles|away: ^ [14:59:43] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:55] jeremyb: https://gerrit.wikimedia.org/r/#/c/126008/1/manifests/certs.pp [15:01:30] ottomata: no, sorry, 1001 came back as non-master [15:01:32] its all good [15:01:38] when you bounced 1001 1002 took over [15:01:56] and now that it has taken over there should be no need for a master election as you bounce the other machines today [15:02:24] its just find [15:07:54] (03CR) 10Hashar: "random thoughts." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [15:08:12] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [15:08:50] (03CR) 10Hashar: "the magic trick seems to work now :-]" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [15:09:39] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [15:10:38] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127236 (owner: 10Hashar) [15:10:57] (03PS2) 10ArielGlenn: formey: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/126976 (owner: 10Matanya) [15:13:01] (03PS2) 10BBlack: Only tag 470-07 if going through proxy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125347 (owner: 10Dr0ptp4kt) [15:13:29] wait, but manybubbles|away [15:13:48] (03CR) 10ArielGlenn: [C: 032] formey: decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/126976 (owner: 10Matanya) [15:13:48] so right now, 1002, 1007 and 1013 are masters, right? [15:13:52] (03PS1) 10JanZerebecki: bugzilla apache config: disable caching directives [operations/puppet] - 10https://gerrit.wikimedia.org/r/127254 [15:13:56] (03CR) 10Dzahn: [C: 032] remove lvs1-6 lvs1-6.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/126954 (owner: 10Dzahn) [15:14:04] and we are about to do the master dance with 1007 and 1008 [15:14:11] won't 1008 become the new master when we take down 1007? [15:14:49] (03PS3) 10BBlack: Only tag 470-07 if going through proxy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125347 (owner: 10Dr0ptp4kt) [15:14:57] (03CR) 10BBlack: [C: 032 V: 032] Only tag 470-07 if going through proxy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125347 (owner: 10Dr0ptp4kt) [15:15:03] !log DNS update - removing lvs1-6 [15:15:09] Logged the message, Master [15:16:41] (03PS1) 10Andrew Bogott: Remove labstore1 and 2 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127255 [15:19:05] (03PS2) 10Andrew Bogott: Remove labstore1 and 2 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127255 [15:20:39] (03PS2) 10BBlack: Set domain to TLD on GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [15:21:42] (03PS3) 10Dzahn: remove Tampa appserver reverse DNS and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/127210 [15:22:29] (03CR) 10Andrew Bogott: [C: 032] Remove labstore1 and 2 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127255 (owner: 10Andrew Bogott) [15:23:16] (03CR) 10Dzahn: [C: 032] remove Tampa appserver reverse DNS and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/127210 (owner: 10Dzahn) [15:24:01] !log DNS update - removing all the Tampa mw/srv mgmt [15:24:06] Logged the message, Master [15:26:10] PROBLEM - Host labstore1 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:10] PROBLEM - Host labstore2 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:43] (03CR) 10Jgreen: [C: 031] Remove db48 and db49 from OTRS mail duties. db49 is decommissioned already so hasn't worked as a secondary for a while. [operations/puppet] - 10https://gerrit.wikimedia.org/r/126203 (owner: 10Springle) [15:28:03] (03PS2) 10ArielGlenn: formey:decom [operations/dns] - 10https://gerrit.wikimedia.org/r/126978 (owner: 10Matanya) [15:28:15] (03PS2) 10Springle: Remove db48 and db49 from OTRS mail duties. db49 is decommissioned already so hasn't worked as a secondary for a while. [operations/puppet] - 10https://gerrit.wikimedia.org/r/126203 [15:28:22] those warnings are my fault, puppet is taking forever on neon (as always) [15:28:29] (03CR) 10Springle: [C: 032] Remove db48 and db49 from OTRS mail duties. db49 is decommissioned already so hasn't worked as a secondary for a while. [operations/puppet] - 10https://gerrit.wikimedia.org/r/126203 (owner: 10Springle) [15:28:39] (03CR) 10ArielGlenn: [C: 032] formey:decom [operations/dns] - 10https://gerrit.wikimedia.org/r/126978 (owner: 10Matanya) [15:28:42] (03PS1) 10JanZerebecki: bugzilla: enable strict transport security [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 [15:32:25] (03CR) 10JanZerebecki: "Though probably correct, not actually tested." [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 (owner: 10JanZerebecki) [15:33:10] ottomata: 1002, 1007, and 1008 are master elgiible [15:33:13] but only 1002 is the master [15:33:28] sorry 1007 and 1013 can take over [15:33:34] but they won't unless 1001 goes down [15:33:51] the quorum that Elasticsearch needs is two out of three eligible masters online [15:36:51] paravoid: Jeff_Green was a outage report written for the fundraising banner issue from yesterday? [15:37:35] (03CR) 10BBlack: [C: 04-1] "A couple things:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [15:38:24] (03PS1) 10Dzahn: remove ms-be/ms-fe Tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127261 [15:38:29] !log switched mchenry to use db1048/db1049 for OTRS address lookups [15:38:34] Logged the message, Master [15:38:56] greg-g: afaik paravoid was working on a report but I haven't seen it yet [15:40:39] Jeff_Green: k, I'd love to chat with K4 about it today, and having something to point at would help, but no major rush (I see the flurry of tampa shutdown activity) [15:41:45] greg-g: sure, I would just track her down in the fundraising channel, the tampa stuff doesn't really affect fundraising anymore [15:42:15] (03PS2) 10Dzahn: remove ms-be/ms-fe Tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127261 [15:43:22] Jeff_Green: what's that channel? [15:43:39] #wikimedia-fundraising [15:43:43] * greg-g prepares to go to window shortcut "G" [15:43:50] logical [15:44:40] (03CR) 10Dzahn: [C: 032] remove ms-be/ms-fe Tampa boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/127261 (owner: 10Dzahn) [15:44:59] (03PS1) 10coren: Remove all traces of labstore[34] from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127262 [15:45:21] !log DNS update - removing Tampa msbe/msfe [15:45:27] Logged the message, Master [15:47:12] (03PS1) 10Springle: Remove db48 from m2. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127263 [15:48:25] PROBLEM - MySQL Slave Delay on db1048 is CRITICAL: CRIT replication delay 336 seconds [15:49:25] PROBLEM - Varnish HTTP mobile-backend on cp3014 is CRITICAL: Connection refused [15:49:26] (03PS1) 10Dzahn: remove labstore 1-4 [operations/dns] - 10https://gerrit.wikimedia.org/r/127264 [15:49:54] hey manybubbles [15:50:02] yo [15:50:04] role/elasticsearch.pp says that 1008 is master eligible [15:50:05] not 1007 [15:50:08] is that correct? [15:50:27] ottomata: as in, right now. let me check [15:50:31] yeah [15:50:53] ottomata: that is correct [15:50:56] I remember! [15:51:01] 1007 was broken for a long time [15:51:10] it kept rebooting so we turned it off until it was fixes [15:51:12] ah right ok, well that's fine then, right? we just do the dance backwards? [15:51:14] and we never made it into a master [15:51:16] sure [15:51:19] ok cool [15:51:21] so do 1007 first [15:51:21] so reinstall 1007 first [15:51:22] k [15:51:24] RECOVERY - MySQL Slave Delay on db1048 is OK: OK replication delay 0 seconds [15:51:45] can't get better then 0 second delay [15:51:49] !log disabling puppet on elasti1007 and elastic1008 for reformatting [15:51:55] Logged the message, Master [15:52:27] (03PS1) 10Ottomata: elastic1007 + master eligible, elastic1008 - master eligible [operations/puppet] - 10https://gerrit.wikimedia.org/r/127265 [15:52:28] !log db48 mysqld set read_only, disabled m2 repl to db1048 [15:52:32] Logged the message, Master [15:52:40] (03CR) 10Ottomata: [C: 032 V: 032] elastic1007 + master eligible, elastic1008 - master eligible [operations/puppet] - 10https://gerrit.wikimedia.org/r/127265 (owner: 10Ottomata) [15:52:57] (03CR) 10Andrew Bogott: [C: 031] remove labstore 1-4 [operations/dns] - 10https://gerrit.wikimedia.org/r/127264 (owner: 10Dzahn) [15:53:10] !log reinstalling elastic1007 [15:53:15] Logged the message, Master [15:53:34] PROBLEM - Varnish HTTP mobile-frontend on cp3014 is CRITICAL: Connection timed out [15:53:48] manybubbles: anomie heh, you two are the only remaining morning SWAT team members since MaxSem moved to SF :) [15:53:49] ^ that's me, and it's not active for prod users [15:53:58] (cp301[34]) [15:54:06] greg-g: i suppose so [15:54:11] and I've been really lazy about it [15:54:14] I wonder who else we can lobby [15:54:33] greg-g: MaxSem moved to SF? [15:54:34] we don't have the hour today, right, because friday [15:54:44] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:45] anomie: it was news to me [15:54:48] but he did [15:54:48] anomie: yep! [15:54:56] he's sitting right here next to me [15:55:14] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 95.46 ms [15:55:34] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:28] (03PS2) 10Springle: Remove db48 from m2. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127263 [15:57:25] RECOVERY - Varnish HTTP mobile-backend on cp3014 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.193 second response time [15:57:51] (03CR) 10coren: [C: 032] Remove all traces of labstore[34] from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/127262 (owner: 10coren) [15:58:05] anomie, he has [15:58:54] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:59:24] (03PS3) 10Springle: Remove db48 from m2. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127263 [15:59:34] (03CR) 10Springle: [C: 032] Remove db48 from m2. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127263 (owner: 10Springle) [16:01:01] (03CR) 10Springle: [V: 032] Remove db48 from m2. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127263 (owner: 10Springle) [16:02:58] (03PS1) 10Springle: Switch db63 for db60 in s1, as latter is on 12th floor. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127267 [16:05:11] (03CR) 10Springle: [C: 032] Switch db63 for db60 in s1, as latter is on 12th floor. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127267 (owner: 10Springle) [16:10:04] PROBLEM - NTP on cp3014 is CRITICAL: NTP CRITICAL: Offset unknown [16:10:09] !log db63 mysqld shutdown for decom [16:10:15] Logged the message, Master [16:14:04] RECOVERY - NTP on cp3014 is OK: NTP OK: Offset -0.002008199692 secs [16:19:44] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:01] !log ignore cp301[34] msgs, reinstalling them [16:20:07] Logged the message, Master [16:20:32] !log db48 mysqld shutdown for decom [16:20:37] Logged the message, Master [16:21:34] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:44] RECOVERY - Host cp3013 is UP: PING WARNING - Packet loss = 93%, RTA = 96.04 ms [16:23:16] (03PS1) 10Springle: Remove db48 from m2, properly this time. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127269 [16:23:44] PROBLEM - Varnish traffic logger on cp3013 is CRITICAL: Connection refused by host [16:23:44] PROBLEM - check configured eth on cp3013 is CRITICAL: Connection refused by host [16:23:54] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:04] PROBLEM - Varnishkafka log producer on cp3013 is CRITICAL: Connection refused by host [16:24:05] PROBLEM - Varnish HTCP daemon on cp3013 is CRITICAL: Connection refused by host [16:24:05] PROBLEM - Varnish HTTP mobile-backend on cp3013 is CRITICAL: Connection refused [16:24:14] PROBLEM - check if dhclient is running on cp3013 is CRITICAL: Connection refused by host [16:24:14] PROBLEM - Varnish HTTP mobile-frontend on cp3013 is CRITICAL: Connection refused [16:24:24] PROBLEM - puppet disabled on cp3013 is CRITICAL: Connection refused by host [16:24:25] PROBLEM - Disk space on cp3013 is CRITICAL: Connection refused by host [16:24:25] PROBLEM - SSH on cp3013 is CRITICAL: Connection refused [16:24:25] PROBLEM - DPKG on cp3013 is CRITICAL: Connection refused by host [16:24:25] PROBLEM - RAID on cp3013 is CRITICAL: Connection refused by host [16:25:00] heya ^d: chasemp and I are trying to delete a project in gerrit and having trouble [16:25:01] you'd think dependencies on the host being unreachable would block the noise :( [16:25:17] (03CR) 10Springle: [C: 032] Remove db48 from m2, properly this time. [operations/puppet] - 10https://gerrit.wikimedia.org/r/127269 (owner: 10Springle) [16:25:44] RECOVERY - Host cp3014 is UP: PING WARNING - Packet loss = 44%, RTA = 96.65 ms [16:26:08] man elastic1007 is not heading my commands to PXE boot [16:26:09] grrr [16:26:12] ottomata and paravoid: looks like java u55 is now considered safe [16:26:14] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [16:26:15] http://www.elasticsearch.org/blog/java-1-7u55-safe-use-elasticsearch-lucene/ [16:26:19] gonna have to catch an F12 opportunity! [16:26:23] haha, really manybubbles, weee! [16:26:40] heeding* [16:26:46] ottomata: well, that is in oracle, I imagine openjdk too given that this part of the code is the same (I think) [16:27:05] in general I've found the F12 method to much more reliable than the racadm serveraction bootonce->pxe stuff [16:27:16] yeah, so far with the elastics I haven't had to do it though [16:27:23] done 12 so far [16:27:26] and this is the first one [16:27:27] but ja [16:27:51] qchris: chasemp and I are having trouble deleting a gerrit repo [16:27:54] we're not sure why [16:27:57] i've been able to do it before [16:28:04] if F12 doesn't work either, try Ctrl+S for ethernet card setup. A few hosts I did recently had PXE disabled inside there. [16:28:04] PROBLEM - Varnish HTCP daemon on cp3014 is CRITICAL: Connection refused by host [16:28:05] PROBLEM - Varnishkafka log producer on cp3014 is CRITICAL: Connection refused by host [16:28:05] PROBLEM - DPKG on cp3014 is CRITICAL: Connection refused by host [16:28:09] ottomata: Which one? [16:28:10] but I can't do it on this one [16:28:14] PROBLEM - check if dhclient is running on cp3014 is CRITICAL: Connection refused by host [16:28:14] PROBLEM - SSH on cp3014 is CRITICAL: Connection refused [16:28:19] hm, ok thanks bblack [16:28:25] PROBLEM - Varnish HTTP mobile-frontend on cp3014 is CRITICAL: Connection refused [16:28:25] PROBLEM - Varnish HTTP mobile-backend on cp3014 is CRITICAL: Connection refused [16:28:25] PROBLEM - Disk space on cp3014 is CRITICAL: Connection refused by host [16:28:25] PROBLEM - check configured eth on cp3014 is CRITICAL: Connection refused by host [16:28:25] PROBLEM - Varnish traffic logger on cp3014 is CRITICAL: Connection refused by host [16:28:25] PROBLEM - puppet disabled on cp3014 is CRITICAL: Connection refused by host [16:28:29] https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/ircd-ratbox [16:28:30] qchris: ^ [16:28:34] PROBLEM - RAID on cp3014 is CRITICAL: Connection refused by host [16:28:44] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:01] ottomata: \(D\|G\)one. [16:30:18] ottomata: Will you take care of cleaning up gitblit and github? [16:31:11] qchris: how did you do it :) [16:31:12] we are going to recreate the repo qchris [16:31:16] yeah how'd you do it?! [16:31:27] chasemp, ottomata: ssh gerrit.wikimedia.org deleteproject delete --yes-really-delete --force operations/debs/ircd-ratbox [16:31:33] i have never cleaned up gitblit or github before! [16:31:34] hm [16:31:41] weird that's basically what we did [16:31:57] GAH I missed F12 on elastic1007 [16:31:58] gaahhh [16:31:58] haha [16:32:03] it was so fast! [16:32:14] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:32:31] but ja qchris we were getting: Gerrit Code Review: delete-project: not found [16:32:41] Ha :-) [16:32:54] There are two delete-project plugins in gerrit. [16:33:07] delete-project was the old plugin name. [16:33:09] oh [16:33:10] hm [16:33:12] so no hyphen? [16:33:16] Now it is deleteproject (no dash) [16:33:18] Yes. [16:33:20] ah no - [16:33:28] my brain didn't even notice [16:33:32] ok, thanks [16:33:47] can update the wiki as well then [16:33:49] ok chasemp, see if you can recreate with no commits [16:33:55] and then push your stuff to it [16:34:04] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:06] i think it would be fine to push your upstream and master branches [16:34:11] directly [16:34:16] but maybe push your debian branch for review? [16:35:11] chasemp: Done. [16:35:46] qchris: I meant I would do it don't want you to think otherwise, but awesome [16:36:10] chasemp: Oh ... Well ... :-D [16:36:14] PROBLEM - NTP on cp3013 is CRITICAL: NTP CRITICAL: No response from NTP server [16:36:22] ottomata: about gitblit + github cleanup ... I have no clue how to do it, as I lack permission to do it. [16:36:24] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:36:40] bah this machine didnt' even give me a change to hit F12 [16:36:44] i was watching the whole time! [16:37:18] chance* [16:38:27] <^d> qchris, ottomata: Hm? [16:38:28] <^d> Sup? [16:38:38] i think we are good, thanks ^d [16:38:45] <^d> Oh ok :p [16:39:25] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:40:14] PROBLEM - NTP on cp3014 is CRITICAL: NTP CRITICAL: No response from NTP server [16:40:24] (03CR) 10Ori.livneh: "> Or are there other, more common cases I'm not thinking of that this would catch?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [16:40:34] bblack: ^ [16:42:10] ottomata: elastic1001 still doesn't want to ganglia! [16:42:37] oh hm [16:42:39] ok will check on that [16:42:41] geez [16:42:41] nope [16:42:45] I can't get 1007 to PXE boot [16:42:48] i just sat there for 5 minutes [16:42:54] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:42:54] hitting F12 every 2 seconds [16:43:00] and it still just boots regular [16:43:01] hmmm [16:43:18] cmjohnson1: i don't suppose you have time to help with that, do you? [16:44:29] ottomata: not right now...if it requires on-site I won't be back to eqiad until April 30th [16:46:54] PROBLEM - Host manutius is DOWN: PING CRITICAL - Packet loss = 100% [16:47:16] ottomata: are you sending the pxe boot one command? [16:47:21] yes [16:47:36] i'd try resetting the drac entirely as well and see if it fixes [16:47:42] hm ok [16:47:44] racadm racreset [16:47:54] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [16:47:59] its about the last ditch to 'unplug it and plug it back in' which as chris states would wait awhile =[ [16:48:30] * RobH wants switched pdus for this reason [16:48:38] i think I missed chris' comment, my internet died for a sec [16:48:44] resetting drac [16:48:52] are we still using manutius? [16:50:38] (03CR) 10BBlack: "Domain is probably more-reasonable, even if it's ambiguous. TLD always explicitly means only the final component, e.g. ".org", or ".de". " [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [16:51:11] (03CR) 10BBlack: "(I meant left-hand-side above!)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [16:53:34] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:34] PROBLEM - Host manutius is DOWN: PING CRITICAL - Packet loss = 100% [16:56:54] nope RobH, no good [16:56:58] the thing just sits at blank console screen [16:57:03] for some minutes [16:57:04] and then boots [16:57:14] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:57:18] and no activity on dhdpd logs for it hitting eh? [16:57:27] (cuz if it hits but has no good reply, it would do that) [16:57:28] ottomata: you never get bios-y outputs? just linux boot output? [16:57:52] nope, just linux boot output [16:58:11] Connected to Serial Device 2. To end type: ^\ [16:58:11] [ 0.000000] Initializing cgroup subsys cpuset [16:58:11] [ 0.000000] Initializing cgroup subsys cpu [16:58:11] [ 0.000000] Linux version 3.2.0-56-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #86-Ubuntu SMP Wed Oct 23 09:20:45 UTC 2013 (Ubuntu 3.2.0-56.86-generic 3.2.51) [16:58:11] [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-56-generic root=UUID=dfe8064d-aa77-4a14-aa54-914bc4f157a5 ro console=ttyS1,115200n8 elevator=deadline rootdelay=90 [16:58:19] maybe something's dorked up with the bios serial redirection stuff [16:59:26] (and I don't know if that's fixable in the mgmt ssh interface, might require VGA?) [16:59:47] i'm rebooting again, tailing syslog on carbon [16:59:52] ottomata: if you would like i can hop on and take a look at it. [16:59:53] grepping for this node [16:59:59] i dont see it hitting carbon [16:59:59] oh sure, would appreciate that RobH [17:00:05] ok [17:00:11] console is yours [17:00:11] ok, im going to drop it into bios [17:00:15] k [17:00:30] just to confirm, this is elastic1007 right? [17:01:03] yup [17:01:45] host elastic1007 { [17:01:45] hardware ethernet a4:ba:db:19:cd:29; [17:01:45] fixed-address elastic1007.eqiad.wmnet; [17:01:47] } [17:01:54] Embedded NIC MAC Addresses: [17:01:54] NIC1 Ethernet = 84:2b:2b:57:f9:08 [17:01:58] thats the problem. [17:02:07] dhcpd lease mac entry is not correct. [17:02:17] (thats whats live in puppet now) [17:02:36] linux-host-entries.ttyS1-115200 needs update [17:03:15] the give away is the syslog on carbon of DHCPDISCOVER from 84:2b:2b:57:f9:08 via 10.64.32.2: network 10.64.32.0/22: no free leases [17:04:00] usually check there for either that or the more confusing and easily overlooked missing vlan change (if you forget to set vlan accordingly it shows the request coming in from the wrong network [17:04:02] ) [17:04:55] (i never got quite to serial console as i was grepping through carbon when i asked if you wanted me to hop on it) off serial back to you ottomata [17:05:56] ahh intersting [17:06:40] the added issue is we have a lot of cruft filling dhcpd that doesnt need to [17:06:58] once in awhile i'll go hunt down offending items and stop them from doing so but haven't in a very long time. [17:07:26] (03PS1) 10Ottomata: Fixing MAC addy for elastic1007 [operations/puppet] - 10https://gerrit.wikimedia.org/r/127275 [17:07:28] (hey thats also an excellent task for figuring out our networks for a newer opsen i bet maybe) [17:07:47] though perhaps after we finish tampa shutdown next week [17:07:48] (03CR) 10Ottomata: [C: 032 V: 032] Fixing MAC addy for elastic1007 [operations/puppet] - 10https://gerrit.wikimedia.org/r/127275 (owner: 10Ottomata) [17:07:53] that'll clear a lot of cruft on its own. [17:11:53] RobH and ottomata: maybe the mac is bad in puppet because it was broken for a long while and we replaced portions of it [17:12:03] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:06] only if the mainboard swapped [17:12:15] but it happens, i wasnt going to bother to see why [17:12:22] (still not ;) [17:12:35] k [17:12:36] haha [17:12:43] I'm pretty sure we swapped the main board on it months ago [17:12:55] there was once a time when i knew all the hw floating in and out of the datacenters in my head [17:12:57] that time is long past [17:13:34] i have the big ticket items, but the daily warranty swaps just happen (well, not really chris is doing all of them ;) [17:13:48] in both tampa and ashburn... [17:13:56] everyone owes cmjohnson1 a beer. [17:14:36] but yea, mainboard swap would explain the mac address change [17:15:02] there was a board swap [17:15:54] (03PS1) 10Gerrit Patch Uploader: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127277 [17:15:56] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127277 (owner: 10Gerrit Patch Uploader) [17:17:13] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:18:23] RECOVERY - SSH on cp3013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [17:19:13] PROBLEM - SSH on elastic1007 is CRITICAL: Connection refused [17:19:13] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [17:19:53] PROBLEM - Disk space on elastic1007 is CRITICAL: Connection refused by host [17:19:53] PROBLEM - RAID on elastic1007 is CRITICAL: Connection refused by host [17:19:53] PROBLEM - check configured eth on elastic1007 is CRITICAL: Connection refused by host [17:19:54] (03PS1) 10Gerrit Patch Uploader: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 [17:19:56] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [17:20:03] PROBLEM - puppet disabled on elastic1007 is CRITICAL: Connection refused by host [17:20:03] PROBLEM - check if dhclient is running on elastic1007 is CRITICAL: Connection refused by host [17:20:03] PROBLEM - DPKG on elastic1007 is CRITICAL: Connection refused by host [17:20:32] ottomata: if its got all those alerts, it means its keys and such are still in puppetstoredconfig, puppetca, salt... [17:20:45] you may wanna clear all that out before it tries to call in post install, or you'll have to do a bit more cleanup [17:20:55] (or maybe you have and neon hasnt run puppet update is all) [17:21:16] yeah that's fine [17:21:23] that's nto for decom [17:21:26] it'll come back up fine [17:21:30] after install [17:21:35] we are just reinstalling to reformat [17:23:01] right [17:23:04] but those keys are borked now [17:23:15] it doesnt matter if its a decom or a reinstall to puppetstoredconfig/puppetca/salt-key [17:23:28] (03CR) 10Vogone: [C: 04-1] "Please abandon this change. The correct one is located at: https://gerrit.wikimedia.org/r/#/c/127278/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127277 (owner: 10Gerrit Patch Uploader) [17:23:36] now when its done installing, it has a new host key, puppet key, salt key [17:23:44] and palladium has the old version of all that [17:23:49] so you have to still delete off those old keys [17:24:02] (and if you dont do it before puppet calls in on the new server, you get to delete stuff off the host as well ;) [17:24:11] on the new install that is, not new server. [17:24:38] (03Abandoned) 10Hoo man: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127277 (owner: 10Gerrit Patch Uploader) [17:27:13] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:16] yes, i delete the stuff on palladium [17:27:18] puppet and salt keys [17:27:27] (03PS2) 10Vogone: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [17:27:37] i have to to do puppetstoredconfig on reinstall RobH? [17:27:38] why? [17:27:43] does it ID by the key? [17:27:43] RECOVERY - Varnish traffic logger on cp3013 is OK: PROCS OK: 2 processes with command name varnishncsa [17:27:43] RECOVERY - check configured eth on cp3013 is OK: NRPE: Unable to read output [17:27:51] i think it does yes, but im not 100% to be honest. [17:28:03] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 87.31 ms [17:28:04] RECOVERY - Varnish HTCP daemon on cp3013 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [17:28:08] but, feel free to find out and let me know =] [17:28:13] RECOVERY - check if dhclient is running on cp3013 is OK: PROCS OK: 0 processes with command name dhclient [17:28:14] RECOVERY - SSH on cp3014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [17:28:16] hmmm [17:28:18] i dunno, i ena [17:28:20] mean [17:28:23] i've done 12 of these nodes already [17:28:23] RECOVERY - puppet disabled on cp3013 is OK: OK [17:28:24] RECOVERY - Disk space on cp3013 is OK: DISK OK [17:28:24] RECOVERY - DPKG on cp3013 is OK: All packages OK [17:28:25] and ahven't done that [17:28:27] and this is fine [17:28:27] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=elastic1006 [17:28:33] RECOVERY - RAID on cp3013 is OK: OK: optimal, 2 logical, 2 physical [17:28:36] oh, so you clear puppetca but not puppetstoredconfig? [17:28:40] ja [17:28:47] then you are right and i am mistaken [17:28:49] =] [17:29:02] :) [17:29:05] i guess as long as its same fqdn and such it doesnt seem to care [17:29:09] yeha [17:29:10] good to know. [17:29:15] 1007 is reinstalling just fine now, btw [17:29:16] danke [17:29:35] i feel like i should somehow reflect this on lifecycle doc [17:30:43] new section after in service before decom/reclaim i suppose [17:30:58] ottomata: So just to check, when doing the reinstall, you had to puppetca clear and salt-key clear right? [17:31:11] i'm using your steps so i dont have to do them all right now personally =] [17:31:19] yes [17:31:22] have to do that [17:31:38] but no puppetstoredconfig clear cuz no IP/FQDN change [17:31:43] cool, adding [17:32:04] RECOVERY - Varnishkafka log producer on cp3013 is OK: PROCS OK: 1 process with command name varnishkafka [17:32:13] PROBLEM - NTP on elastic1007 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:30] i'm going to put that its fqdn and ip based (with an note the ip part needs checking) [17:32:53] (since a fqdn can remain unchanged but IP changes can occur in reinstall) [17:33:04] RECOVERY - Varnish HTTP mobile-backend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.197 second response time [17:33:13] RECOVERY - SSH on elastic1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [17:33:13] RECOVERY - Varnish HTTP mobile-frontend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.188 second response time [17:34:51] RobH, here's one I'm not sure about [17:34:59] at what point can you sign a new salt key? [17:35:05] i've been waiting until after the first puppet run [17:35:07] after the first puppet run [17:35:22] if you try before, it will just tell you there isn't an unaccepted one yet [17:35:25] but that means that you have to run puppet twice often [17:35:38] yeah, i just had a case where it had a key to sign...not sure why [17:35:42] ohoh [17:35:44] because I deleted it [17:35:51] and the thing kept booting back into the original OS [17:35:56] I usually have to run puppet 3-4 times anyways, because our dependencies are never perfect [17:35:57] so it probably ran salt [17:36:00] and tried to make a new key [17:36:02] ok yeah [17:36:09] elastics are really good! [17:36:15] everything (except for salt) just works [17:36:24] thanks to manybubbles :) [17:36:53] what'd I do? [17:36:54] on the varnish boxes it's typically ganglia and varnishkafka that don't work right until the 3rd or 4th puppet run [17:37:06] varnishkafka!? [17:37:08] hm [17:37:15] oh something with the varnishkafkaganglia module [17:37:26] bblack, next time you have to do one [17:37:32] ping me, i'd love to fix that [17:37:42] ottomata: what bblack said, yep [17:37:44] manybubbles: just wrote some good puppet stuff [17:37:44] also https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reinstallation [17:37:56] ottomata: review that section and see if it matches with reality if you dont mind =] [17:38:08] its a bit more verbose to cover more use cases [17:38:13] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:38:14] RECOVERY - check if dhclient is running on cp3014 is OK: PROCS OK: 0 processes with command name dhclient [17:38:24] RECOVERY - Disk space on cp3014 is OK: DISK OK [17:38:24] RECOVERY - puppet disabled on cp3014 is OK: OK [17:38:24] RECOVERY - check configured eth on cp3014 is OK: NRPE: Unable to read output [17:38:24] RECOVERY - Varnish traffic logger on cp3014 is OK: PROCS OK: 2 processes with command name varnishncsa [17:38:33] RECOVERY - RAID on cp3014 is OK: OK: optimal, 2 logical, 2 physical [17:39:03] RECOVERY - Varnish HTCP daemon on cp3014 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [17:39:04] RobH, yeah, some of those won't really apply for many reinstalls [17:39:12] like, removing from dsh groups [17:39:13] RECOVERY - DPKG on cp3014 is OK: All packages OK [17:39:52] ottomata: if you want to look for the missing dep on varnishkafka+ganglia stuff: [17:39:55] err: /Stage[main]/Varnishkafka::Monitoring/Exec[generate-varnishkafka.pyconf]/returns: change from notrun to 0 failed: /usr/bin/python /usr/lib/ganglia/python_modules/varnishkafka.py --generate --tmax=15 /var/cache/varnishkafka/varnishkafka.stats.json > /etc/ganglia/conf.d/varnishkafka.pyconf.new returned 1 instead of one of [0] at /etc/puppet/modules/varnishkafka/manifests/monitoring.pp:25 [17:40:03] RECOVERY - Varnishkafka log producer on cp3014 is OK: PROCS OK: 1 process with command name varnishkafka [17:40:14] probably a lack of a dep on creating some directory, or something trivial like that [17:40:27] hm, ok it can't run the python code to generate the ganglia module conf [17:40:34] hm, do you have a node where it just did that? [17:40:40] i'd like to run the command manually and see what the output is [17:40:44] cp3014.esams.wm.org [17:40:46] maybe it needs to require => ganglia or somethring [17:40:47] ok cool [17:41:07] but it may work the first time you try manually, since something later in the puppet run probably fixed it [17:41:35] yeah, ah, it runs now [17:41:37] hm [17:41:46] ottomata: as long as i dont have anything that breaks process its cool. i realize dsh groups, pybal mentions, db.php mentions are not applicatible to most situations [17:41:47] (03CR) 10John F. Lewis: [C: 031] "If it is needed, I guess. Patch is fine anyway." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [17:42:32] I would if we could sync up dsh groups to salt grains and auto-generate the dsh groups from salt [17:42:33] ohhh, that's because it needs the .stats.json file to be present, and that's not going to be present until varnishkafka runs for the first time and outputs one [17:42:47] bblack: that would be awesome, i want salt to replace the apache scripts so badly. [17:42:48] or just use salt! do we need dsh at all anymore? [17:42:50] e.g. dsh group cache_mobile == salt -G cluster:cache_mobile [17:42:53] RECOVERY - Disk space on elastic1007 is OK: DISK OK [17:42:53] RECOVERY - RAID on elastic1007 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:42:53] RECOVERY - check configured eth on elastic1007 is OK: NRPE: Unable to read output [17:42:56] we have to use dsh for all the odd scripts [17:43:00] until they are migrated [17:43:03] RECOVERY - check if dhclient is running on elastic1007 is OK: PROCS OK: 0 processes with command name dhclient [17:43:03] RECOVERY - puppet disabled on elastic1007 is OK: OK [17:43:03] RECOVERY - DPKG on elastic1007 is OK: All packages OK [17:43:10] (03CR) 10Ori.livneh: "The procedure for determining the topmost domain that a site can set cookies for is a bit complicated. You want www.cnn.com to be able to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [17:43:11] (those same scripts are the ones that require key forwarding) [17:43:16] so they are already slightly evil ;] [17:43:31] now that i have a fancy no forwarding my key ssh config i hate them even more. [17:43:33] and while salt is way more advanced and capable, dsh (and similar tools) are usually better at basic dsh functionality than salt cmd.run [17:44:04] hm, ok aye [17:44:10] bblack, hm, actually i'm not sure how to fix that one... [17:44:11] but yea, it would be nice to ditch it [17:44:19] which is what the entire new deployment saga is about [17:44:32] i dont even wanna mention its name, cuz i know folks have pings based off it ;] [17:45:00] The Saga That Cannot Be Named (TSTCBN) [17:45:00] (03CR) 10MaxSem: [C: 04-1] TextExtracts: Add classes and elements to the exclusion list (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna) [17:45:15] so, the ganglia module conf file is generated from JSON keys in a stats file that varnishkafka generates [17:45:43] so, a file that a service writes to has to be present and have content before that exec runs successfully [17:45:55] that file (and especially the content) is not managed by puppet [17:46:09] probably the exec shouldn't be managed by puppet either, then :) [17:46:24] i could probably add test -f stats.file.json || command [17:46:31] that way the exec will just return true if the file doesn't exist [17:46:38] well or use onlyif on the exec [17:46:45] hm yeah [17:46:46] hm [17:46:53] yeahhhhHHhh [17:46:56] that is better [17:47:52] manybubbles: finally 1007 is back and good [17:47:59] yay [17:48:05] !log resinsalling elastic1008 [17:48:07] go ahead and do 1008 then [17:48:11] Logged the message, Master [17:49:20] ottomata: probably should unexclude 1007 from the list [17:49:23] RECOVERY - Varnish HTTP mobile-frontend on cp3014 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.176 second response time [17:49:24] RECOVERY - Varnish HTTP mobile-backend on cp3014 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.191 second response time [17:49:28] yeah [17:49:29] hm [17:49:30] um [17:49:34] i can't log into 1008 mgmt [17:49:40] umm...RobH? can you? [17:49:41] elastic1008 [17:50:28] ottomata: at least shut down Elasticsearch on it though [17:50:48] so it loses master eligibility [17:50:51] !log cp301[34] reinstalls complete, should stay ok in monitoring [17:50:57] Logged the message, Master [17:53:43] ok [17:56:03] RECOVERY - NTP on elastic1007 is OK: NTP OK: Offset -0.005284428596 secs [17:56:53] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [18:09:17] (03PS3) 10Ori.livneh: Set domain to TLD on GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 [18:10:02] (03CR) 10Ori.livneh: "@bblack: I changed 'tld' to the more accurate 'top_cookie_domain' and expanded the comments to include a discussion of the public suffix i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [18:11:24] ottomata: sorryhad a phone call [18:11:34] so im not getting a ping on elastic1008.mgmt [18:11:48] seems it either isnt setup, or has a bad cable, that kind fo thing [18:12:20] ottomata: if you dont mind since you discovered it you should put in a ticket into eqiad stating that elastic1008.mgmt is unresponsive to ssh or ping [18:12:49] though if you need it taken care of before the 29th let me know. [18:13:11] we'll have to pay equnix smart hands to investigate it, which means giving them a very, very specific (and tailored to them) set of instructions [18:13:30] (03PS1) 10Spage: Enable Flow on mw:Talk:Beta_Features/Nearby_Pages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127280 [18:13:56] this one wouldnt be too hard to instruct on as long as we dont care about uptime of elastic1008 while they work the issue [18:14:28] (03CR) 10Spage: [C: 04-2] "Deploy in Flow window Tuesday April 22 2pm PDT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127280 (owner: 10Spage) [18:18:28] (Sorry, am in meeting) [18:18:48] wait, what RobH? [18:18:53] i'm logged into elastic1008 right now [18:18:56] oh you mean mgmt IP? [18:19:00] oh [18:19:26] I think we should log outage in SAL [18:19:31] outage-s [18:19:39] Can't find any mention of the one yesterday [18:23:02] Hm, log says 14:51 to 15:28 UTC [18:44:43] PROBLEM - Puppet freshness on elastic1008 is CRITICAL: Last successful Puppet run was Fri 18 Apr 2014 03:44:05 PM UTC [18:49:53] PROBLEM - Host ps1-c3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:56:55] hm, ok manybubbles, so what should I do with 1008 then? [18:57:00] sounds like it won't happen for a while [18:57:06] should I run puppet on it so it picks up the change [18:57:10] and just restart elasticsearch [18:57:12] and then move shards to it? [18:57:30] ottomata: yeah, run puppet so it picks up the change, bounce elasticsearch, then remove the blocks [18:58:13] PROBLEM - Host ps1-d3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:58:17] ottomata: may as well remove the blocks from 1007 too [18:58:53] PROBLEM - Host ps1-d2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:58:53] PROBLEM - Host ps1-d1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:59:06] k [19:00:03] RECOVERY - Puppet freshness on elastic1008 is OK: puppet ran at Fri Apr 18 18:59:56 UTC 2014 [19:00:39] ok, moving shards back to both 1007 and 1008 [19:00:53] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:01:30] greg-g, what are your thoughts on me deploying code to tantalum (my pdf test server) today? it's deployed via trebuchet so it should be contained to just my stuff [19:01:59] mwalker: test server? any impact with prod pdf redenering? [19:02:23] !log enabled cp30[14] varnish mobile frontends in esams pybal [19:02:29] Logged the message, Master [19:02:33] greg-g, nope; not production at all [19:02:39] ok manybubbles, do you think its ok to move on to 1013,1014? [19:03:01] mwalker: have fun then! [19:03:05] kk! [19:03:10] ottomata: I'd give it some time to recover back to the other machines [19:03:16] * mwalker prepares to bring down the cluster inadvertantly [19:03:27] ok cool [19:03:28] yeah [19:03:28] honestly might as well wait until monday, really [19:03:32] ok let's do that [19:04:49] (03PS1) 10BBlack: enable cp3013 mobile backend in esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/127286 [19:05:42] (03CR) 10BBlack: [C: 032 V: 032] enable cp3013 mobile backend in esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/127286 (owner: 10BBlack) [19:07:21] (03PS1) 10BBlack: fix missing comma in 922bea82 [operations/puppet] - 10https://gerrit.wikimedia.org/r/127287 [19:07:26] that's what I get for not waiting on slow-ass jenkins [19:07:44] (03CR) 10BBlack: [C: 032 V: 032] fix missing comma in 922bea82 [operations/puppet] - 10https://gerrit.wikimedia.org/r/127287 (owner: 10BBlack) [19:12:25] <^d> ottomata, manybubbles: Am I right in reading? Just 13/14 to go? [19:12:33] that's right! well [19:12:34] and 1008 [19:12:40] 1008 has a problem with the mgmt interface [19:12:42] cant' reboot it1 [19:12:43] ! [19:12:46] can't log into reboot it [19:12:47] <^d> Ah ok, missed that [19:12:54] updating ticekt... [19:13:57] ^d, just fyi, i wasn't able to automate every step of the formatting [19:13:58] :/ [19:14:09] the raid and parittions are created fine [19:14:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:14:14] but there are some fs tweaks that should be made [19:14:22] whenever a new node is created [19:14:24] https://wikitech.wikimedia.org/wiki/Search#Adding_new_nodes [19:14:25] <^d> Ah ok. [19:14:26] i added the steps there [19:14:34] <^d> But hey, partial automation's better than totally manual :) [19:14:41] fortunately, if they are forgotten, its not a abig deal [19:14:48] they can be done on the partitoin any time [19:14:54] partition just has to be unmounted [19:14:56] not reformatted [19:22:56] (03PS1) 10BBlack: remove cp3013 from esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/127293 [19:23:10] (03CR) 10BBlack: [C: 032 V: 032] remove cp3013 from esams backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/127293 (owner: 10BBlack) [19:24:09] not sure what's wrong there, but adding 3013 caused the 5xx spike [19:27:04] ottomata: So elastic1008 is in service though right? [19:27:13] yes it is [19:27:17] but we can take it out pretty easily [19:27:18] you may want to append that to the ticket, as the fix is rebooting it in to bios [19:27:26] ok [19:27:40] preventing future outages, wooo =] [19:27:55] usually chris checks on each server anyhow, but its always nice to save him as many steps as possible [19:28:07] yeah [19:28:14] RobH, was that the proper queue/ [19:28:14] ? [19:28:21] eqiad? or would core-ops have been better? [19:28:43] eqiad [19:28:55] anything that requires physical hands on goes in those queues [19:29:03] if its software work, then not in those [19:29:20] (i realize im overexplaining each question, but we have new opsen! ;) [19:30:24] I appreciate the verbosity :) [19:30:26] since this requires some dude to attach a crash cart, specific queue for said datacenter [19:30:47] yea, i like that all the new opsen are irc lurkers and have backlogs for issues to review [19:30:59] huzzah irc bouncers =] [19:47:47] (03PS1) 10Ottomata: Only running generate-varnishkafka.pyconf command if varnishkafka.stats.json exists [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/127343 [19:48:51] (03CR) 10Ottomata: [C: 032 V: 032] Only running generate-varnishkafka.pyconf command if varnishkafka.stats.json exists [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/127343 (owner: 10Ottomata) [19:49:37] (03PS1) 10Ottomata: Updating varnishkafka module with monitoring dependency fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/127346 [19:49:51] (03PS2) 10Ottomata: Updating varnishkafka module with monitoring dependency fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/127346 [19:49:56] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module with monitoring dependency fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/127346 (owner: 10Ottomata) [20:22:33] (03PS1) 10Chad: Opt remaining wikis into Cirrus beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127363 [20:25:18] \o/ ^d [20:25:31] <^d> :) [20:25:32] ^d: When are you planning to merge this? [20:25:46] I was just about to add the other change to Tech News [20:25:53] <^d> Soon as we're done disk-juggling. Hopefully no later than the first part of next week. [20:26:12] Okay. Will keep an eye on it then :) [20:29:51] ^d: can we set a date for that, maybe? :) pretty please :) [20:30:24] <^d> Boo, that's no fun! [20:30:40] :P [20:37:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:47:30] (03CR) 10Manybubbles: [C: 031] Opt remaining wikis into Cirrus beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127363 (owner: 10Chad) [20:49:22] manybubbles: sorry about that [20:49:33] ottomata: ? [20:49:36] plugins [20:49:49] was really flustered when 1007 came back online [20:50:16] 1008 was not working, toby all the sudden joined the meeting I had been waiting for him to join, and my lunch had just arrived [20:50:22] i just deployed them now [20:50:35] can I bounce elasticsearch real quick? [20:50:49] oh 1007? [20:55:19] ottomata: do the "quick" bounce instructions [20:55:22] that should be ok [20:55:26] meh, [20:55:29] I'll get it later [20:55:31] its ok [20:55:33] I don't need them right now [20:56:26] oh! 1007 is reporting down in ganglia [20:56:32] and 1001 is gone [20:56:51] ok, i'll let you get it later [20:56:55] looking into 1007 and 1001 ganglia now [20:58:02] thanks! [21:00:39] !log Jenkins renamed mw-jenkinsbot irc bot to wmf-insecte (french for "bug"). Updated IRC conf to point to chat.freenode.net:7000 with SSL. [21:00:45] Logged the message, Master [21:09:05] ok manybubbles, fixed, it was a weird issue where the sysctl properties weren't reloaded properly [21:09:10] ganglia was unhappy about that [21:35:11] ^d, do you have any thoughts on why gerrit ssh would not agree with gerrit https (I'm missing the last commit of the wmf-deploy branch or ssh://mwalker@gerrit.wikimedia.org:29418/mediawiki/services/ocg-collection.git when compared to https://gerrit.wikimedia.org/r/mediawiki/services/ocg-collection/deploy/.git wmf-deploy branch) [21:35:28] oh; never mind [21:35:37] I'm looking at different repos [21:35:42] <^d> :) [21:35:46] <^d> Glad I could help! [21:35:57] you were a very helpful duck :) [21:44:25] ori, gwicke; if you have a moment I'm having some conceptual difficulties with git deploy I think -- the group wikidev should have rw on /srv/deployment/ ? [21:44:52] mwalker, normally that's owned by root [21:45:08] on tin or the deployment target(s)? [21:45:13] * for context I'm trying to get ocg/ocg deployed to tantalum [21:45:14] git-deploy is basically a thin wrapper around sudoed salt calls [21:45:16] this is on tin [21:45:33] oh, on tin it should be owned by tin [21:45:38] eh, wikidev [21:45:43] * gwicke needs more caffeine [21:46:08] heh; so right now ocg/ocg is owned by trebuchet/wikidev [21:46:09] mwalker, are you using submodules? [21:46:15] but wikidev only has r, not rw [21:46:19] and yes; I'm using submodules [21:46:34] there might be bugs lurking in that area [21:46:38] that sounds like a plain' ole bug, yeah [21:46:54] at a minimum you'll have to manually git submodule update --init [21:46:55] * ori digs through the source to see if any culprits jump out [21:47:06] gwicke, yep; I had to do that [21:47:35] ori, can you give wikidev rw on ocg/ocg in the interim? [21:47:46] (I dont know if you're a root) [21:49:17] sure [21:51:04] mwalker: try now [21:51:29] yay! it works again [21:51:31] thanks ori! [21:52:26] mwalker, did the deploy work? [21:52:42] well; the checkout worked at least [21:52:54] and; before when it was trying to deploy the wrong repo it worked [21:53:09] (I deleted the directory and had puppet recreate it; which is how I lost my permissions) [21:53:15] step by step.. [21:55:11] mwalker: filed ; can you add any relevant details? [21:55:47] sure [22:00:26] ori, ah crap; I really screwed myself; not only did puppet not give wikidev the right permissions after I; it also failed to set the deploy.* git configuration variables [22:00:41] guess I should've done the more hacky change to the repo origin [22:00:46] *shrugs* [22:01:31] mwalker, don't forget checkout-submodules = true [22:08:32] ^d, now I can use your help though... I need to clean up the mess I left in gerrit whilst trying to figure out where everything should go! can you delete the "operations/ocg-config.git" and "mediawiki/services/ocg-collection/deploy.git" repos? [22:09:45] ^d, it would also be wonderful to actually be a member of the mediawiki-services-ocg-collection group :p [22:13:11] <^d> done, done and done [22:14:40] yay! thanks kindly [22:15:13] mwalker: do you need me to do anything? [22:16:00] I dont think so; everything that should be seems to be on tantalum; it's not running; but I think I know why and can fix that myself [22:26:42] cool [23:16:14] (03PS18) 10BryanDavis: Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [23:16:29] (03PS19) 10BryanDavis: Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [23:18:50] (03CR) 10BryanDavis: "This has been cherry-picked into the beta puppet master and used to get the core scap process working there." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 (owner: 10BryanDavis) [23:22:52] (03CR) 10PiRSquared17: [C: 04-1] New namespace aliases for lt test project on BetaWV (037 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:34:18] (03PS3) 10Gerrit Patch Uploader: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 [23:34:20] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:35:20] (03CR) 10John F. Lewis: [C: 031] New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:37:07] (03CR) 10PiRSquared17: [C: 031] New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:44:41] (03CR) 10Quiddity: "Where is class=noexcerpt used currently?! I can't find it in the common/vector CSS (in core or on enwiki), nor mentioned anywhere on mw.or" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (owner: 10Prtksxna)