[00:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T0000). [00:02:01] PROBLEM - HHVM rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:03:19] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 75116 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:12:11] * paladox here. [00:14:42] 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10faidon) I'm not at all sure, but I don't see an LD 5 at all. Is it possible that instead of remaining as a degraded LD (with a failed disk... [00:17:48] (03PS3) 10Dzahn: phabricator: activate read-only mode for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/511929 (https://phabricator.wikimedia.org/T196019) [00:26:37] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:27:53] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 75080 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:33:34] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Krinkle) @Joe The cache hit ratio and rate per second aren't metrics we previously monitored for HHVM. In retrospect, I suppose that would've been useful. I was hoping we'd h... [00:37:33] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Krinkle) [00:38:13] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Krinkle) [00:38:21] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [00:38:30] (03PS1) 10Dzahn: phabricator: rsync /srv/repos from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/512077 (https://phabricator.wikimedia.org/T221389) [00:38:37] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Krinkle) > for example https://en.wikipedia.org/w/index.php?tit... [00:40:02] (03PS2) 10Dzahn: phabricator: rsync /srv/repos from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/512077 (https://phabricator.wikimedia.org/T221389) [00:40:47] (03CR) 10Dzahn: [C: 03+2] phabricator: rsync /srv/repos from 1001 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/512077 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn) [00:42:11] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@e040c6c]: Deploy GUI update [00:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:53] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:50] (03PS1) 10Dzahn: Revert "phabricator: rsync /srv/repos from 1001 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/512078 [00:45:02] !log phab1003 - rsyncing /srv/repos from phab1001 [00:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:09] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 75080 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:52:05] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@e040c6c]: Deploy GUI update (duration: 09m 54s) [00:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:33] !log extended phab downtime in icinga, actual downtime hasn't started yet, prep work taking longer than expected [01:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:17] (03PS4) 10Dzahn: phabricator: activate read-only mode for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/511929 (https://phabricator.wikimedia.org/T196019) [01:17:58] (03CR) 10Dzahn: [C: 03+2] phabricator: activate read-only mode for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/511929 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:18:23] !log phabricator going readonly momentarily [01:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:13] (03PS1) 10Dzahn: Revert "phabricator: activate read-only mode for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/512079 [01:20:49] (03PS3) 10Dzahn: phabricator: Mediawiki -> Wikimedia and fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/511926 [01:22:09] (03CR) 10Dzahn: [C: 03+2] phabricator: Mediawiki -> Wikimedia and fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/511926 (owner: 10Dzahn) [01:22:54] (03PS10) 10Dzahn: switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) [01:27:45] (03CR) 10Dzahn: [C: 03+2] switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:28:45] !log stopping phd on phab1001 [01:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:50] !log switched from phab1001 to phab1003 - applied on cp1008 varnish canary first [01:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:11] !log run puppet on mx1001/mx2001 - switch mail route for phab to phab1003 [01:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab1001-vcs.eqiad.wmnet [01:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:22] !log depooled phab1001-vcs from git-ssh via conftool [01:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:59] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:39:57] PROBLEM - PyBal IPVS diff check on lvs1002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:41:41] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:41:51] PROBLEM - PyBal IPVS diff check on lvs1005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:43:35] PROBLEM - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:48:44] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 20after4 scheduled maint [01:48:44] ACKNOWLEDGEMENT - puppet last run on phab1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[phd] 20after4 scheduled maint [01:48:44] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 20after4 scheduled maint [01:49:30] !log puppetmaster1001 - conftool-merge [01:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab1003-vcs.eqiad.wmnet [01:52:10] !log phabricator is now served by phab1003 though still in read-only mode for a bit longer [01:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:43] 'Read-only mode was enabled by the explicit action of a human administrator, so you can get more information about why it has been turned on by rolling your chair away from your desk and yelling "Hey! Why is Phabricator in read-only mode??!" using your very loudest outside voice.' [01:58:27] PROBLEM - PHD should be running on phab1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:59:24] ACKNOWLEDGEMENT - PHD should be running on phab1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 497 (phd) daniel_zahn migration from phab1001 https://wikitech.wikimedia.org/wiki/Phabricator [01:59:25] ACKNOWLEDGEMENT - PHD should be supervising processes on phab1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 497 (phd) daniel_zahn migration from phab1001 https://wikitech.wikimedia.org/wiki/Phabricator [02:00:38] mutante: what happen [02:00:47] ah [02:00:51] i see :) [02:00:58] chaomodus: failed to schedule the downtime on the new host before the switch, sorry [02:01:03] for that one service phd [02:01:08] No worries just making sure since it paged :) [02:01:23] yea, i noticed and tried to send the ACK asap [02:01:33] right that's that upgrade that was previously alluded to [02:01:49] i have a different minor issue though [02:02:00] merged change in conftool and ran conftool-merge [02:02:10] but somehow it did not get the new config yet [02:02:19] doesn't conftool use etcd? [02:02:58] afaik yes [02:03:03] yes [02:03:24] so i saw this in the output of conftool-merge: [02:03:30] Creating node with tags eqiad/phabricator/git-ssh/phab1003-vcs.eqiad.wmnet [02:03:34] which is what we want [02:03:52] oh.. maybe it is just the lack of sudo -i again [02:04:02] Backend error: The request requires user authentication : Insufficient credentials [02:04:08] https://wikitech.wikimedia.org/wiki/Conftool#Insufficient_credentials [02:04:09] Ahh yes [02:04:37] !log puppetmaster1001 - sudo -i conftool-merge [02:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:45] 2019-05-23 02:04:27 [INFO] conftool::cleanup: Removing node with tags eqiad/phabricator/git-ssh/phab1001-vcs.eqiad.wmnet [02:04:50] 2019-05-23 02:04:27 [INFO] conftool::load: Creating node with tags eqiad/phabricator/git-ssh/phab1003-vcs.eqiad.wmnet [02:04:54] looks better [02:05:03] i hope this will soon fix the pybal alerts [02:05:22] also.. that "phd not running" isnt even that important.. it should not page us [02:05:28] right twentyafterfour [02:05:29] Yes [02:05:38] Er, actually [02:05:48] didn't that get turned into critical after some need for it to page? [02:06:05] there was something about it but it's been a while.. yea [02:06:27] i like that i now see something for [02:06:28] [cumin1001:~] $ sudo -i confctl select name=phab1003-vcs.eqiad.wmnet get [02:06:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab1003-vcs.eqiad.wmnet [02:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:41] right phd should page me but not all od sre ;) [02:07:00] we should make a ticket for that ... once phab is up , heh [02:07:05] k [02:07:33] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:07:33] twentyafterfour: try git-ssh again? [02:07:34] we can probably enable writes in phab noq [02:07:38] yay RECOVERY :) [02:07:41] RECOVERY - PyBal IPVS diff check on lvs1005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:07:42] ok I'll try git-ssh [02:07:44] pheew [02:07:49] that freaked me out a bit [02:08:26] rescheduling the rest of the checks [02:08:47] RECOVERY - PyBal IPVS diff check on lvs1002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:08:47] RECOVERY - PyBal IPVS diff check on lvs1014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:09:06] also.. STILL rsyncing /srv/repos :P [02:09:20] should have done a pre-run to speed it up [02:09:35] mutante: can't test git-ssh repos still syncing [02:09:42] oh, right ok [02:09:44] but it's a non-essential service [02:09:49] good! [02:10:17] I think we should be good to revert the read-only, as far as I can see [02:10:45] Krenair: this is https://lists.wikimedia.org/pipermail/wikitech-l/2019-May/092089.html [02:11:07] we just took longer than planned [02:11:20] yeah its no problem [02:11:30] I just enjoyed the read-only message from phab [02:11:36] heh, ok :) [02:11:42] * twentyafterfour loves phabricator humor [02:12:31] I used my loudest possible voice but i got no answer. [02:12:58] the old server stopped listening :) [02:13:44] cool, everything looks good to me [02:13:52] twentyafterfour: anything else to test? we are literally just waiting for rsync now? [02:13:54] still rsyncing but getting close [02:13:56] ah, great [02:14:29] tests bugzilla.wikimedia.org :PP [02:14:33] 41G of 47G [02:14:57] git.wikimedia.org redirects [02:15:16] bugzilla does too.. and phab1001-vcs is gone from confctl [02:15:56] they still both have the LVS ip ..but that's on "lo" and was also the case before [02:16:24] oh we need to check the phab realtime notifications [02:17:27] it shows "Disconnected" [02:17:57] phab1001 is in the list of "phabricator_servers" in hiera. but that does NOT mean it's the "active_server"... just "a" server [02:18:27] and that is used for firewall rules [02:18:30] and that's it [02:19:00] sudo systemctl start aphlict [02:19:02] Failed to start aphlict.service: Unit aphlict.service not found. [02:19:11] uhhh [02:19:19] how's that possible [02:19:27] is phab1003 the active_server yet? [02:19:34] yes, it is [02:19:46] hmm, maybe the unit isn't called aphlict [02:20:23] yea. it's also not found on 1001 :p [02:20:27] it's called aphlict twentyafterfour [02:20:32] https://github.com/wikimedia/puppet/blob/12bdb6c9c3f213d76aa90c6a0fceb1a6848d7a54/modules/phabricator/manifests/aphlict.pp#L83 [02:20:40] ● aphlict.service not-found failed failed aphlict.service [02:20:46] ^ this is the OLD server though [02:21:32] systemd::service { 'aphlict': ... ??? wut [02:22:14] there is not even a difference in the puppet code [02:22:20] no "if jessie" or anything like it [02:23:23] Boolean $aphlict_enabled = hiera('phabricator_aphlict_enabled', false), [02:23:33] sudo -u aphlict /srv/phab/phabricator/bin/aphlict start [02:23:35] Reading configuration from: phabricator/conf/aphlict/aphlict.default.json [02:23:37] Writing logs to: /var/log/aphlict.log [02:23:37] role/eqiad/phabricator.yaml:phabricator_aphlict_enabled: true [02:23:39] Aphlict Server started. [02:23:44] wtf ? [02:24:06] how does it have puppetized systemd unit but just not existing anywhere [02:24:14] I don't know [02:24:21] the service started manually ok [02:24:28] !log manually started aphlict on phab1003 [02:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:34] i am still glad you got it running of course, yay! [02:25:16] looks like rsync is done [02:25:23] yes, confirmed! [02:25:31] second ticket we should make is the aphlict thing [02:25:48] let me revert the rsync temp change now [02:26:00] actually, running it twice..but that should be seconds [02:26:49] (03PS2) 10Dzahn: Revert "phabricator: rsync /srv/repos from 1001 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/512078 [02:27:07] paladox: does it also not show disconnected anymore?:) [02:27:14] * paladox checks [02:27:17] it's connected for me [02:27:23] rsync finished a second time [02:27:25] it shows connected for me [02:27:34] (03CR) 10Dzahn: [C: 03+2] "rsync is done now" [puppet] - 10https://gerrit.wikimedia.org/r/512078 (owner: 10Dzahn) [02:27:38] great [02:28:03] performance also feels much better! Pages loading almost instantly. [02:28:18] hopefully not just because nobody is using it, heh [02:28:43] heh [02:30:04] twentyafterfour: do you want the forensic logging too? [02:30:09] on the new host that is [02:30:14] lets remove the readonly... [02:30:19] also... a bunch of puppet warnings [02:30:25] the forensic logging might not be needed but it could be useful someday [02:30:27] related to aphlict [02:30:41] did we overlook something in puppet after all... [02:30:54] but if php7 fixes the apache workers leaking then we may not need that [02:31:02] hmmm /me looks at puppet.log [02:31:08] no, wait.. also phd failed to start [02:31:17] puppet tried to start it but did not work [02:31:50] Unable to establish a write-mode connection (to application database "phabricator_worker") because Phabricator is in read-only mode [02:31:52] mutante: it's because it's read-only [02:31:53] ok, haha [02:31:56] it can't run phd in read-only [02:32:02] like in codfw. ack [02:32:04] that's the entire problem [02:32:14] because without phd it won't try to install aphlict [02:32:39] indeed. Phabricator::Aphlict/File[/srv/phab/phabricator//support/aphlict/server/node_modules]: Dependency Service[phd] has failures: true [02:32:41] I should probably remove that dependency from puppet because it's not really necessary to have phd for aphlict [02:32:47] yes, that :) [02:33:06] well..ok then.. let's do it [02:33:11] go for it [02:33:27] i like how it always goes back and forth between "horribly broken" and "ah, nevermind, ok" :)) [02:33:36] :) [02:33:40] (03PS2) 10Dzahn: Revert "phabricator: activate read-only mode for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/512079 [02:33:55] (03CR) 1020after4: [C: 03+1] Revert "phabricator: activate read-only mode for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/512079 (owner: 10Dzahn) [02:34:18] twentyafterfour: eh.. we could do this only for phab1003 though [02:34:23] and leave 1001 in readonly [02:34:46] or just disable puppet there [02:35:22] disabled puppet on 1001 [02:35:36] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: activate read-only mode for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/512079 (owner: 10Dzahn) [02:35:38] !log phabricator - going read-write again [02:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:48] Notice: /Stage[main]/Profile::Phabricator::Main/Php::Extension[mysqlnd]/Package[php7.2-mysqlnd]/ensure: created [02:36:53] Notice: /Stage[main]/Phabricator::Aphlict/File[/var/run/aphlict/]/ensure: created [02:37:02] Notice: /Stage[main]/Phabricator::Aphlict/Systemd::Service[aphlict]/Service[aphlict]/ensure: ensure changed 'stopped' to 'running' [02:37:06] Notice: Applied catalog in 22.02 seconds [02:37:10] no more puppet issues [02:37:28] paladox: yes, it DOES feel faster :) [02:37:34] :) [02:37:53] RECOVERY - PHD should be running on phab1003 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:38:01] yay [02:38:05] there, first write action was to resolve the ticket to setup 1003 [02:38:09] 10Operations, 10hardware-requests, 10serviceops: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) [02:38:11] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn) 05Open→03Resolved [02:38:13] unfortunately this created yet another page :P [02:38:28] but there we go [02:39:47] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
 2019-05-23      02:35 mutante: phabricator - going read-write again     02:24 twentyafterfour: manually started aphlict on phab1003     02:06 dzahn@cumin10...
[02:39:56] 	 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) >>! In T219279#5207043, @Krinkle wrote: > This now work...
[02:39:57] 	 now let's not forget this was only to be able to reinstall phab1001 .. 
[02:40:21] 	 kind of wish that would be fully done with this.. maybe we can still do that
[02:41:09] 	 ! downtimed the systemd state on phab1001 for 1 year 
[02:41:17] 	 !log downtimed the systemd state on phab1001 for 1 year 
[02:41:21] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:41:44] 	 also our exernal catchpoint monitoring had noticed phab was down.. and then that it came back
[02:42:44] 	 hmm so there was actual downtime not just read-only? 
[02:43:51] 	 yea.. ALERT! Phabricator: Operation timed out after 10000 milliseconds with 0 bytes received (Timeout was reached)
[02:44:35] 	 and then " https service on
[02:44:38] 	 'phabricator.wikimedia.org' has been working again"
[02:44:47] 	 hmm https://phabricator.wikimedia.org/diffusion/APIOS/manage/
[02:44:48] 	 but that's not what we saw 
[02:44:52] 	 shows "Pull of 'rAPIOS' failed: Command failed with error #128! COMMAND git remote set-url origin -- '********' STDOUT (empty) STDERR error: could not lock config file config: Permission denied fatal: could not set 'remote.origin.url' to 'https://github.com/wikimedia/wikipedia-ios.git'"
[02:45:46] 	 eh.. wouldnt that be a local issue ?
[02:47:09] 	 twentyafterfour: the owner of /srv/repos is "vcs" instead of "phd"
[02:47:13] 	 on phab yup (i guess chrown /srv/repos with phd:www-data i guess)
[02:47:52] 	 i hate the root cause of this.. which is that users never have the same UID globally
[02:48:13] 	 once tried to standardize them on wikitech page "UID"
[02:48:31] 	 uid=498(vcs) gid=498(phd) groups=498(phd)
[02:48:36] 	 uid=498(phd) gid=498(phd) groups=498(phd)
[02:49:55] 	 they are exactly switched around 
[02:50:10] 	 heh
[02:50:15] 	 and it's just about which gets created first on the puppet run 
[02:51:34] 	 !log phab1003 - chown -R phd /srv/repos/
[02:51:37] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:42] 	 yeah and puppet is non-deterministic 
[02:51:58] 	 and rsync doesn't map usernames 
[02:52:08] 	 paladox: how about now?
[02:52:11] 	 PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:56:07] 	 !log phab1001 - removing community_metrics and project_changes cron jobs to avoid duplicate mails
[02:56:10] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:57:33] * paladox checks
[02:58:12] 	 Could someone press the update button on https://phabricator.wikimedia.org/diffusion/APIOS/manage/ please?
[02:58:17] 	 I’m mobile
[02:59:01] 	 Works!
[02:59:08] 	 i am not allowed to :) but nice
[03:03:11] 	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:03:53] 	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:04:09] 	 grmbl
[03:04:58] 	 Telia again
[03:05:42] 	 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Resolved→03Open and..it is DOWN again   23:03 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, ex...
[03:06:34] 	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:06:34] 	 ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:09:10] 	 - Maintenance window:
[03:09:12] 	 Start Date and Time: 2019-May-23 03:00 UTC 
[03:09:34] 	 well i guess that matches perfectly
[03:10:24] 	 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) - Maintenance window: Start Date and Time: 2019-May-23 03:00 UTC  End Date and Time: 2019-May-23 07:00 UTC   Action and Reason: Emergency hardware work needed to restore traffic. We will rese...
[03:10:39] 	 ok! i think we are good
[03:11:44] 	 twentyafterfour, bug report in -dev
[03:11:59] 	 18<shreyasminocha18> Call to undefined function ldap_connect()
[03:12:18] 	  occurs when i try to login via ldap
[03:12:58] 	 ii  php7.2-ldap 
[03:13:11] 	 SIGH
[03:14:55] 	 ahh
[03:16:23] 	 https://www.php.net/manual/en/function.ldap-connect.php  says it is in 7
[03:17:09] 	 LDAP support in PHP is not enabled by default. You will need to use the --with-ldap[=DIR] 
[03:17:28] 	 mutante: ^
[03:18:05] 	 " This package provides the LDAP module(s) for PHP.
[03:18:59] 	 do we need to restart php or maybe the php.ini isn't loading that module? 
[03:20:02] 	 ls /etc/php/7.2/apache2/conf.d/*ldap*
[03:20:04] 	 /etc/php/7.2/apache2/conf.d/20-ldap.ini
[03:20:06] 	 twentyafterfour@phab1003:/srv/phab$ ls /etc/php/7.2/fpm/conf.d/*ldap*
[03:20:08] 	 ls: cannot access '/etc/php/7.2/fpm/conf.d/*ldap*': No such file or directory
[03:20:10] 	 twentyafterfour@phab1003:/srv/phab$ 
[03:20:14] 	 we have ldap.ini for apache2 but not for fpm
[03:20:58] 	 mutante: ^ do you know where in puppet that conf.d is configured? seems it needs to be fpm aware 
[03:21:35] 	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:21:39] 	 I can probably figure it out if you need to take a break 
[03:21:49] 	 but I'll need someone to merge the patch once I finish it 
[03:22:00] 	 manually copying that file temporarily 
[03:22:03] 	 twentyafterfour: i created a manual link first
[03:22:11] 	 wait
[03:22:13] 	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:22:24] 	 it's just a symlink to mods-available
[03:22:27] 	 and i did it
[03:22:57] 	 phd failed to start again
[03:23:31] 	 twentyafterfour: looks like you copied the file over the symlink. does it work though?
[03:23:42] 	 it should be a link 
[03:24:22] 	 mutante: fixed it
[03:24:42] 	 :)
[03:25:31] 	 twentyafterfour: php::extension{}
[03:25:41] 	 ?
[03:25:57] 	 oh
[03:26:03] * twentyafterfour will work on a patch
[03:26:12] 	 that's kind of new
[03:26:39] 	  modules/profile/manifests/mediawiki/php.pp uses it
[03:27:07] 	 !log restarted php-fpm on phab1003
[03:27:08] 	 sapis => ['fpm']
[03:27:10] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:32:28] 	 I think you add ldap to https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/mediawiki/php.pp#L137
[03:33:11] 	 that's the mediawiki profile though
[03:33:49] 	 Oh wrong class
[03:34:21] 	 RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[03:34:38] 	 Here https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/phabricator/main.pp#L264
[03:34:49] 	 paladox: yea, similar code but in another place.. right
[03:34:59] 	 also i like to see that recovery on the slave
[03:35:07] 	 i would like to go eat then now..
[03:35:16] 	 looks like we got the things under control
[03:35:36] 	 twentyafterfour: maybe just check that puppet does not remove that file and phd starts again?
[03:36:47] 	 Puppet will remove the file I think
[03:36:52] 	 Notice: /Stage[main]/Php/File[/etc/php/7.2/fpm/conf.d/20-ldap.ini]/ensure: removed
[03:36:55] 	 yes :P
[03:36:56] 	 Yup
[03:39:02] 	 !log phab1003 - disabling puppet; /etc/php/7.2/fpm/conf.d# ln -s /etc/php/7.2/mods-available/ldap.ini 20-ldap.ini ; systemctl restart php7.2-fpm
[03:39:05] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:39:06] 	 twentyafterfour: ^
[03:39:10] 	 right?
[03:40:20] 	 (03PS1) 1020after4: phab: include php-ldap in the php-fpm config [puppet] - 10https://gerrit.wikimedia.org/r/512080 (https://phabricator.wikimedia.org/T151070)
[03:41:30] 	 (03CR) 10Dzahn: [C: 03+2] phab: include php-ldap in the php-fpm config [puppet] - 10https://gerrit.wikimedia.org/r/512080 (https://phabricator.wikimedia.org/T151070) (owner: 1020after4)
[03:42:43] 	 -extension=ldap.so
[03:42:43] 	 +extension = ldap.so
[03:43:08] 	 :Extension[ldap]/File[/etc/php/7.2/mods-available/ldap.ini]/content
[03:43:14] 	 :)
[03:43:17] 	 Notice: /Stage[main]/Profile::Phabricator::Main/Php::Extension[ldap]/File[/etc/php/7.2/fpm/conf.d/20-ldap.ini]/ensure: created
[03:43:22] 	 Notice: /Stage[main]/Profile::Phabricator::Main/Php::Extension[ldap]/File[/etc/php/7.2/mods-available/ldap.ini]/mode: mode changed '0644' to '0444'
[03:43:28] 	 Php needs restarting for things to take effect
[03:43:54] 	 ok, restart done
[03:44:00] 	 :)
[03:44:23] 	 ldap login works 
[03:44:34] 	 (03PS2) 10Andrew Bogott: nova: make all services active/active [puppet] - 10https://gerrit.wikimedia.org/r/511950 (https://phabricator.wikimedia.org/T223905)
[03:44:36] 	 (03PS1) 10Andrew Bogott: designate: make designate nodes active/active [puppet] - 10https://gerrit.wikimedia.org/r/512081 (https://phabricator.wikimedia.org/T223905)
[03:45:19] 	 twentyafterfour: great! :)  ok, i will go eat now.. but won't be far away and have the phone
[03:47:46] 	 thanks man! 
[03:48:00] 	 !log puppet runs cleanly on phab1003 
[03:48:03] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:48:14] 	 Notice: Applied catalog in 19.25 seconds
[03:50:11] 	 !log m3 database activity levels look like they have returned to normal 
[03:50:14] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:55:01] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:55:19] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:55:47] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:56:27] 	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[03:56:47] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:56:47] 	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[03:57:55] 	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[03:58:35] 	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[03:59:15] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:59:21] 	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[03:59:35] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:59:37] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:00:03] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:03:29] 	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[04:03:35] 	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[04:03:58] 	 (03PS1) 1020after4: phab: Configure sapis for all php extensions [puppet] - 10https://gerrit.wikimedia.org/r/512082 (https://phabricator.wikimedia.org/T151070)
[04:04:57] 	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[04:05:15] 	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[04:05:37] 	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:09:09] 	 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10tstarling) Basic porting work on wmerrors is hopefully complete.  It still writes a text-based log entry to a socket. I suppose JS...
[04:24:49] 	 !log start nl, pt, pl wiki dumps to fill the new parsoid tables - T215956
[04:24:57] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:24:57] 	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[04:34:46] 	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512083
[04:37:42] 	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512083 (owner: 10Marostegui)
[04:38:41] 	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512083 (owner: 10Marostegui)
[04:38:56] 	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512083 (owner: 10Marostegui)
[04:40:11] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080 (duration: 00m 58s)
[04:40:15] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:58] 	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512084
[04:43:13] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512084 (owner: 10Marostegui)
[04:44:56] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512084 (owner: 10Marostegui)
[04:45:23] 	 (03PS1) 10Marostegui: db1136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512086 (https://phabricator.wikimedia.org/T222682)
[04:46:32] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080 (duration: 00m 55s)
[04:46:35] 	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512084 (owner: 10Marostegui)
[04:46:36] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:47:20] 	 (03CR) 10Marostegui: [C: 03+2] db1136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512086 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[04:50:40] 	 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Open→03Resolved 23:21 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0   23:22 <+icin...
[04:50:55] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Stalled→03Open a:03Dzahn
[04:51:07] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn)
[04:52:48] 	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1136 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512087 (https://phabricator.wikimedia.org/T222682)
[04:55:28] 	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
[04:55:31] 	 (03PS1) 10Dzahn: update SPF records for Phabricator to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/512088 (https://phabricator.wikimedia.org/T221389)
[04:57:00] 	 (03CR) 10Dzahn: [C: 03+2] update SPF records for Phabricator to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/512088 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn)
[04:57:54] 	 !log decommission restbase1009-a - T223976
[04:57:59] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:59] 	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[04:58:19] 	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] db-eqiad,db-codfw.php: Pool db1136 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512087 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[04:59:26] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db1136 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512087 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[04:59:37] 	 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Dzahn)
[05:00:14] 	 (03PS1) 10Marostegui: mariadb: Remove db2040 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/512089 (https://phabricator.wikimedia.org/T224079)
[05:00:24] 	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1136 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512087 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[05:00:45] 	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1136 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512087 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[05:02:13] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1136 into s7 T222682 (duration: 00m 55s)
[05:02:17] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:17] 	 T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682
[05:03:21] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db1136 into s7 T222682 (duration: 00m 55s)
[05:03:25] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:41] 	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2040 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/512089 (https://phabricator.wikimedia.org/T224079) (owner: 10Marostegui)
[05:08:21] 	 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Marostegui) a:05Marostegui→03RobH This host is fully ready for DCOPs to take over and decommission
[05:13:46] 	 (03PS1) 10Marostegui: db-codfw.php: Depool db2107 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512091 (https://phabricator.wikimedia.org/T220170)
[05:15:59] 	 (03PS1) 10Marostegui: db2107: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/512092 (https://phabricator.wikimedia.org/T220170)
[05:16:11] 	 (03PS2) 10Marostegui: db-codfw.php: Clarify db2017 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512091 (https://phabricator.wikimedia.org/T220170)
[05:16:51] 	 (03CR) 10Marostegui: [C: 03+2] db2107: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/512092 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[05:18:08] 	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clarify db2017 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512091 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[05:19:08] 	 (03Merged) 10jenkins-bot: db-codfw.php: Clarify db2017 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512091 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[05:19:25] 	 (03CR) 10jenkins-bot: db-codfw.php: Clarify db2017 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512091 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui)
[05:20:29] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify db2107 status - will be the new master (duration: 00m 54s)
[05:20:33] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:36] 	 (03PS1) 10Marostegui: db-codfw.php: Promote db2070 to m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512094 (https://phabricator.wikimedia.org/T221533)
[05:29:17] 	 !log Promote db2070 to m5 codfw master instead of db2037 - T221533
[05:29:21] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:21] 	 T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533
[05:31:07] 	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2070 to m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512094 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[05:31:50] 	 (03Merged) 10jenkins-bot: db-codfw.php: Promote db2070 to m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512094 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[05:32:04] 	 (03CR) 10jenkins-bot: db-codfw.php: Promote db2070 to m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512094 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[05:33:07] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2070 as m5 codfw master - T221533 (duration: 00m 54s)
[05:33:11] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:28] 	 (03PS1) 10Marostegui: db2070: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512095 (https://phabricator.wikimedia.org/T221533)
[05:35:17] 	 (03CR) 10Marostegui: [C: 03+2] db2070: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512095 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[06:14:48] 	 !log start ruwiki dumps to fill the new parsoid tables - T215956
[06:14:54] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:54] 	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[06:25:12] 	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512096 (https://phabricator.wikimedia.org/T221533)
[06:26:14] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512096 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[06:27:13] 	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512096 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[06:27:27] 	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512096 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[06:28:53] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2065 from config as it will be moved to m3 to replace db2042 (duration: 00m 56s)
[06:28:56] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:27] 	 (03PS1) 10Marostegui: db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512097 (https://phabricator.wikimedia.org/T221533)
[06:29:57] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2065 from config as it will be moved to m3 to replace db2042 (duration: 00m 55s)
[06:30:00] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:12] 	 PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[06:30:26] 	 (03PS1) 10Marostegui: db-eqiad.php: More weight to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512098
[06:31:00] 	 (03CR) 10Marostegui: [C: 03+2] db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512097 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui)
[06:32:20] 	 (03CR) 10KartikMistry: [C: 03+1] deployment-prep: Use new cxserver running in Docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510586 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk)
[06:33:26] 	 PROBLEM - puppet last run on ores1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[06:36:42] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512098 (owner: 10Marostegui)
[06:37:38] 	 (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512098 (owner: 10Marostegui)
[06:37:51] 	 (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512098 (owner: 10Marostegui)
[06:38:54] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1136 (duration: 00m 55s)
[06:38:57] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:25] 	 RECOVERY - puppet last run on ores1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:01:43] 	 RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:02:36] 	 (03PS1) 10Marostegui: mariadb: Provision db1128 into m3 [puppet] - 10https://gerrit.wikimedia.org/r/512099 (https://phabricator.wikimedia.org/T222682)
[07:03:17] 	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 461 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:03:26] 	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1128 into m3 [puppet] - 10https://gerrit.wikimedia.org/r/512099 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[07:06:01] 	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 41 probes of 422 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:06:10] 	 (03PS2) 10Marostegui: mariadb: Provision db1128 into m3 [puppet] - 10https://gerrit.wikimedia.org/r/512099 (https://phabricator.wikimedia.org/T222682)
[07:07:02] 	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1128 into m3 [puppet] - 10https://gerrit.wikimedia.org/r/512099 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[07:08:37] 	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 461 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:09:22] 	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] "This violation is a well known fact, and it requires refactoring of the misc role, which is already under discussion" [puppet] - 10https://gerrit.wikimedia.org/r/512099 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui)
[07:11:46] 	 !log Stop MySQL on db1117:3323 to clone db1128 T222682
[07:11:50] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:50] 	 T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682
[07:11:56] 	 ^ this will generate a dbproxy alert
[07:15:15] 	 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:15:22] 	 ^ expected :)
[07:15:43] 	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[07:16:27] 	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 422 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[07:16:59] 	 (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512102
[07:17:10] 	 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) > @Volans wrote: > I'm assuming that in cases in which the rebase fails because of conflicts or the CI fails after the rebase Jenkins would vote -1 and the patch would b...
[07:18:03] 	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:18:18] 	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy
[07:20:27] 	 (03PS2) 10Marostegui: db-eqiad.php: More traffic to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512102
[07:23:25] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512102 (owner: 10Marostegui)
[07:24:26] 	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512102 (owner: 10Marostegui)
[07:26:36] 	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1136 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512102 (owner: 10Marostegui)
[07:26:54] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1136 (duration: 00m 53s)
[07:26:57] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:27] 	 !log rebooting swift frontends in eqiad
[07:33:30] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:20] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[07:34:20] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:34:23] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:26] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:29] 	 RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[07:36:55] 	 RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[07:48:42] 	 (03CR) 10Mobrovac: [EventBus] Add eventgate-main event service. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko)
[07:50:14] 	 10Operations, 10Performance-Team, 10serviceops, 10HHVM, 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Marostegui) I am removing the DBA tag for now here (but remain subscribed), as we are following up on {T22401...
[07:51:45] 	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512106 (https://phabricator.wikimedia.org/T224017)
[07:56:34] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512106 (https://phabricator.wikimedia.org/T224017) (owner: 10Marostegui)
[07:57:34] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512106 (https://phabricator.wikimedia.org/T224017) (owner: 10Marostegui)
[07:57:48] 	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512106 (https://phabricator.wikimedia.org/T224017) (owner: 10Marostegui)
[07:58:46] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1080 (duration: 00m 56s)
[07:58:50] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:25] 	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] firewall loggin: enable firewall logging on wmcs servers [puppet] - 10https://gerrit.wikimedia.org/r/511701 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[08:15:47] 	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I can confirm we don't use this in the current version of Toolforge." [puppet] - 10https://gerrit.wikimedia.org/r/511791 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[08:25:34] 	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512109
[08:26:24] 	 !log rebooting scb servers in codfw
[08:26:27] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:37] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:26:39] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:41] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:26:44] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:45] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512109 (owner: 10Marostegui)
[08:27:25] 	 good morning
[08:27:50] 	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I thought a bit about this, and I changed my mind on where this should belong. Sorry that I gave you a different feedback on IRC last nigh" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[08:27:53] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512109 (owner: 10Marostegui)
[08:28:07] 	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512109 (owner: 10Marostegui)
[08:29:03] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080 (duration: 00m 55s)
[08:29:06] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:09] 	 !log Upgrade MySQL and kernel on db1080
[08:29:11] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:49] 	 (03PS2) 10Ppchelko: [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822)
[08:34:03] 	 (03PS2) 10Hashar: Re-apply "group1 wikis to 1.34.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511953 (https://phabricator.wikimedia.org/T224116) (owner: 10Jforrester)
[08:34:35] 	 (03CR) 10Hashar: "Amended to also revert 9791a9fd1c92f03f4a3cd7e3f1869f7c57f103aa which was for cawikinews." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511953 (https://phabricator.wikimedia.org/T224116) (owner: 10Jforrester)
[08:36:55] 	 marostegui: are you busy with databases right now? I could use half an hour or so to deploy mediawiki in prod again :)
[08:40:06] 	 (03CR) 10Gehel: [C: 04-1] "see comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[08:41:05] 	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) @Krinkle very interesting data about the resourceloader performance, I think it's no coincidence. More on that below.  Coming to APCu: it's a very different beast than h...
[08:42:07] 	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512114
[08:42:15] 	 (03CR) 10Mobrovac: [C: 03+1] [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko)
[08:42:56] 	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) Looking better at the last graph you pasted, I don't really explain myself why the p95 went down so much when we reintroduced php7 at 10% with respect to before we reena...
[08:46:07] 	 (03CR) 10Hashar: [C: 03+2] Re-apply "group1 wikis to 1.34.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511953 (https://phabricator.wikimedia.org/T224116) (owner: 10Jforrester)
[08:46:11] 	 assuming it is fine
[08:47:08] 	 (03Merged) 10jenkins-bot: Re-apply "group1 wikis to 1.34.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511953 (https://phabricator.wikimedia.org/T224116) (owner: 10Jforrester)
[08:47:22] 	 (03CR) 10jenkins-bot: Re-apply "group1 wikis to 1.34.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511953 (https://phabricator.wikimedia.org/T224116) (owner: 10Jforrester)
[08:49:33] 	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Re apply group1 wikis to 1.34.0-wmf.6 T220731
[08:50:37] 	 hashar@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[08:50:37] 	 T220731: 1.34.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T220731
[08:55:30] 	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512115
[08:56:20] 	 Database.php: No atomic section is open (got User::addToDatabase). :D
[08:56:26] 	 those errors messages are more cryptic than ever
[08:56:44] 	 marostegui: I have finished upgrading group 1 wikis to 1.34.0-wmf.6  so mediawiki-config is free for deployment ;)
[08:59:02] 	 (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512114 (owner: 10Marostegui)
[08:59:28] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512115 (owner: 10Marostegui)
[08:59:48] 	 hashar: thanks :)
[09:00:42] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512115 (owner: 10Marostegui)
[09:01:19] 	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512115 (owner: 10Marostegui)
[09:01:59] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1080 (duration: 00m 55s)
[09:02:03] 	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[09:02:04] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:03] 	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[09:09:14] <_joe_>	 uhm around the time of the deploy?
[09:09:23] <_joe_>	 yes
[09:10:03] 	 !log rebooting scb servers in eqiad
[09:10:06] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:11] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:10:14] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:14] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:10:18] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:01] 	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: make apc size configurable, bump for appservers [puppet] - 10https://gerrit.wikimedia.org/r/512118 (https://phabricator.wikimedia.org/T223180)
[09:21:19] 	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[09:22:06] 	 that's me ^
[09:22:30] 	 !log bounce rsyslog on lithium - listener stuck /T199406
[09:22:34] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:36] 	 sigh
[09:22:41] 	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 884 days) https://wikitech.wikimedia.org/wiki/Logs
[09:23:01] 	 ah no it actually worked, stashbot updated the task
[09:25:40] 	 10Operations, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi)
[09:32:00] 	 (03CR) 10Filippo Giunchedi: [C: 03+2] Set spares for ms-be[12]01[345] [puppet] - 10https://gerrit.wikimedia.org/r/510819 (https://phabricator.wikimedia.org/T220590) (owner: 10Filippo Giunchedi)
[09:32:08] 	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: make apc size configurable, bump for appservers [puppet] - 10https://gerrit.wikimedia.org/r/512118 (https://phabricator.wikimedia.org/T223180)
[09:32:10] 	 (03PS3) 10Filippo Giunchedi: Set spares for ms-be[12]01[345] [puppet] - 10https://gerrit.wikimedia.org/r/510819 (https://phabricator.wikimedia.org/T220590)
[09:33:27] 	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1080 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512123
[09:34:49] 	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[09:35:21] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1080 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512123 (owner: 10Marostegui)
[09:35:22] 	 10Operations, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10Maintenance_bot)
[09:35:51] 	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[09:36:32] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1080 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512123 (owner: 10Marostegui)
[09:36:47] 	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1080 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512123 (owner: 10Marostegui)
[09:37:28] <_joe_>	 why do cassandra alerts link to resolved tickets that have little to do with them?
[09:37:39] <_joe_>	 also godog I suppose that's not you right?
[09:37:48] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080 into API (duration: 00m 55s)
[09:37:51] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:08] 	 _joe_: modules/cassandra/manifests/instance/monitoring.pp:        notes_url     => 'https://phabricator.wikimedia.org/T93886',
[09:38:10] 	 _joe_: no I think that's expired downtimes for the decoms
[09:38:17] 	 (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: make apc size configurable, bump for appservers [puppet] - 10https://gerrit.wikimedia.org/r/512118 (https://phabricator.wikimedia.org/T223180)
[09:38:19] <_joe_>	 godog: ok thanks
[09:38:27] <_joe_>	 marostegui: I know the how
[09:38:31] <_joe_>	 I'm asking the why :D
[09:39:11] <_joe_>	 so this server is out of the cassandra cluster but still not migrated to role::spare::server
[09:39:43] 	 _joe_: the commit's comment doesn't say why :(
[09:46:02] 	 10Operations, 10decommission, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi)
[09:46:11] 	 (03CR) 10Jbond: [C: 03+1] "LGTM, one minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[09:49:30] 	 10Operations, 10decommission, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) a:03RobH Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to...
[09:51:45] 	 !log hashar@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/Collection: Rename wfAjaxCollectionGetItemList() T224093 (duration: 00m 57s)
[09:51:49] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:49] 	 T224093: [Collections] Fatal error: Call to undefined function wfAjaxCollectionGetItemList() - https://phabricator.wikimedia.org/T224093
[09:51:50] 	 10Operations, 10ops-codfw, 10decommission, 10media-storage, and 2 others: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi)
[09:52:18] 	 10Operations, 10ops-codfw, 10decommission, 10media-storage, and 2 others: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi) a:03RobH Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH
[09:54:19] 	 _joe_: godog: sorry, that me, forgot to silence icinga
[09:54:30] 	 again...
[09:55:25] <_joe_>	 mobrovac: np
[09:56:12] 	 RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:56:34] 	 (03CR) 10Filippo Giunchedi: [C: 03+1] role: remove prometheus backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/511734 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite)
[09:56:47] 	 sigh, i wanted to set downtime in icinga, and i got "not authorised"
[09:56:52] 	 :(
[09:57:29] 	 ill try acking
[09:57:43] 	 I can ACK for you if you cannot do that
[09:57:45] 	 Or downtime
[09:58:27] 	 can't ack either
[09:58:36] 	 it seems something messed with my config
[09:58:57] 	 Maybe you didn't loging as Mobrovac and used mobrovac instead?
[09:58:59] 	 i used to be able to do that, and i also used to receive alerts by email, but none of that is happening anymore :(
[09:59:06] 	 you want me to ACK or or downtime?
[09:59:09] 	 no no i used Mobrovac
[09:59:13] 	 yes please mar
[09:59:16] 	 marostegui: ^
[09:59:18] 	 ok!
[09:59:22] 	 gracais!
[09:59:31] 	 digh, can't even type today apparently
[09:59:33] 	 lol
[09:59:49] 	 XDDD
[09:59:51] 	 ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused Marostegui acked on behalf of mobrovac https://phabricator.wikimedia.org/T93886
[09:59:51] 	 ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Marostegui acked on behalf of mobrovac https://phabricator.wikimedia.org/T120662
[10:00:09] 	 thnx marostegui
[10:00:25] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[10:00:28] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:29] 	 !log rebooting remaining mw servers in codfw (sans mcrouter proxies for now)
[10:00:32] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:45] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:00:48] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:01] 	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi I'm resolving this since we're going to decom this host in {T220907}, thanks @Cmjohnson !
[10:03:54] 	 10Operations, 10decommission, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) Also a note re: ms-be1013, it had its raid failed in {T220907} and currently I wasn't able to make it boot again. Not worth spending more ti...
[10:08:04] 	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: make apc size configurable, bump for appservers [puppet] - 10https://gerrit.wikimedia.org/r/512118 (https://phabricator.wikimedia.org/T223180) (owner: 10Giuseppe Lavagetto)
[10:10:55] 	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512130
[10:13:53] 	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512130 (owner: 10Marostegui)
[10:14:52] 	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512130 (owner: 10Marostegui)
[10:15:32] <_joe_>	 !log restarted php7.2-fpm on mw1261 to assess the effect of a larger APCu shm size T223180
[10:15:36] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:37] 	 T223180: Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180
[10:16:04] 	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1080 (duration: 00m 57s)
[10:16:07] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:48] 	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512130 (owner: 10Marostegui)
[10:19:08] 	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:20:50] 	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:24:02] 	 not sure about these alerts ^ onimisionipe perhaps ?
[10:24:09] 	 also gehel ^
[10:24:26] 	 looking
[10:27:15] 	 dcausse: ^
[10:28:08] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[10:28:11] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:19] 	 (03PS1) 10Filippo Giunchedi: graphite: fix dashboard links for thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512133
[10:28:28] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:28:31] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:48] 	 volunteers for rubberstamping https://gerrit.wikimedia.org/r/c/operations/puppet/+/512133 welcome
[10:30:04] 	 Amir1: (Dis)respected human, time to deploy Deploy Entity Schema to testwikidatawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1030). Please do the needful.
[10:30:12] 	 o/ 
[10:30:37] 	 Can I deploy? or should I stop because of cirrus search?
[10:36:49] 	 (03PS2) 10Ladsgroup: deploy WikibaseSchema to test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511844 (https://phabricator.wikimedia.org/T216956)
[10:38:18] 	 (03CR) 10Ladsgroup: [C: 03+2] deploy WikibaseSchema to test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511844 (https://phabricator.wikimedia.org/T216956) (owner: 10Ladsgroup)
[10:39:22] 	 (03Merged) 10jenkins-bot: deploy WikibaseSchema to test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511844 (https://phabricator.wikimedia.org/T216956) (owner: 10Ladsgroup)
[10:39:37] 	 (03CR) 10jenkins-bot: deploy WikibaseSchema to test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511844 (https://phabricator.wikimedia.org/T216956) (owner: 10Ladsgroup)
[10:40:06] 	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:40:36] 	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[10:43:48] 	 10Operations, 10Discovery-Search (Current work): Cleanup puppet hieradata for logstash - https://phabricator.wikimedia.org/T224074 (10Maintenance_bot)
[10:44:10] 	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Maintenance_bot)
[10:44:21] 	 sorry, I was out, checking those alerts
[10:44:34] 	 !log ladsgroup@mwmaint1002:/srv/mediawiki/php-1.34.0-wmf.5$ mwscript sql.php --wiki=testwikidatawiki extensions/EntitySchema/sql/EntitySchema.sql (T216956)
[10:44:38] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:39] 	 T216956: deploy EntitySchema to test - https://phabricator.wikimedia.org/T216956
[10:44:48] 	 yep, can ignore at the moment, looks like a too aggressive threshold on a new check
[10:45:08] 	 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Maintenance_bot)
[10:45:30] 	 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Maintenance_bot)
[10:47:10] 	 10Operations, 10observability, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Maintenance_bot)
[10:47:25] 	 looks good, going live
[10:48:02] 	 10Operations, 10Traffic: cp3031: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10Maintenance_bot)
[10:48:35] 	 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Maintenance_bot)
[10:50:02] 	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:511844|deploy WikibaseSchema to test (T216956)]] (duration: 00m 56s)
[10:50:05] 	 !log ladsgroup@mwmaint1002:/srv/mediawiki/php-1.34.0-wmf.5$ mwscript sql.php --wiki=wikidatawiki extensions/EntitySchema/sql/EntitySchema.sql (T216955)
[10:50:06] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:06] 	 T216956: deploy EntitySchema to test - https://phabricator.wikimedia.org/T216956
[10:50:10] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:10] 	 T216955: deploy EntitySchema to production - https://phabricator.wikimedia.org/T216955
[10:50:32] 	 10Operations: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Maintenance_bot)
[10:50:47] 	 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Maintenance_bot)
[10:50:59] 	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, 10User-greg: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10zeljkofilipin) 👍 from me.
[10:51:04] 	 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Maintenance_bot)
[10:51:11] 	 actually, looks like there might have been a real issue with cirrus updates
[10:51:16] 	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10zeljkofilipin)
[10:51:18] 	 10Operations, 10ops-eqiad: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Maintenance_bot)
[10:51:27] 	 nothing alarming (yet), but needs more investigation
[10:51:34] 	 10Operations, 10fundraising-tech-ops: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10Maintenance_bot)
[10:51:37] 	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) After doubling the cache size on one server, I noticed the cache-hit ratio plateaued between 80% and 90% after ~ 150 MB of space were occupied. I'll let it grow more, bu...
[10:51:51] 	 10Operations, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Maintenance_bot)
[10:51:57] 	 !log Deploying EntitySchema to testwikidatawiki is done
[10:52:00] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:07] 	 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Maintenance_bot)
[10:59:51] 	 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Volans) @hashar another question for you. If I have 2 CRs, chained one on top of another and I +2 both of them because I want to deploy them together, and the first one fails bu...
[11:00:05] 	 MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1100).
[11:00:05] 	 No GERRIT patches in the queue for this window AFAICS.
[11:10:25] 	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add jfishback to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/511897 (https://phabricator.wikimedia.org/T222910) (owner: 10Volans)
[11:14:11] 	 10Operations, 10Performance-Team, 10serviceops, 10HHVM, 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Marostegui) The slow query that is being investigated at T224017 is only reported as slow when `'rev_page'= '...
[11:14:15] 	 (03PS2) 10Volans: admin: add jfishback to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/511897 (https://phabricator.wikimedia.org/T222910)
[11:16:56] 	 (03CR) 10Volans: [C: 03+2] admin: add jfishback to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/511897 (https://phabricator.wikimedia.org/T222910) (owner: 10Volans)
[11:21:25] 	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime
[11:21:25] 	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:21:28] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:41] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:51] 	 (03PS1) 10Jbond: striker: add example documentation [puppet] - 10https://gerrit.wikimedia.org/r/512136
[11:23:35] 	 !log rebooting auth1002 for kernel update
[11:23:38] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:05] 	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime
[11:34:05] 	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:34:08] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:12] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:58] 	 (03PS3) 10Jbond: wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769
[11:36:37] 	 (03Abandoned) 10Jbond: wmcs openstack: re-add striker::uwsgi::secret_config config [puppet] - 10https://gerrit.wikimedia.org/r/511771 (owner: 10Jbond)
[11:37:05] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[11:37:08] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:27] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:37:30] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:28] 	 PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[11:50:49] 	 !log will shortly start rolling reboots of thumbor servers
[11:50:52] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:21] 	 (03PS3) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753
[12:00:05] 	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1200)
[12:04:56] 	 !log powercycling mw2268 (stuck after reboot)
[12:04:58] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:17] 	 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10MoritzMuehlenhoff) What is this blocked on, it's not obvious from the task description?
[12:15:58] 	 RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[12:22:17] 	 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 (10MoritzMuehlenhoff) @herron This task is done, is elnath still used for anything? Otherwise can you decom the Ganeti VM?
[12:31:56] 	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond)
[12:34:08] 	 PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:35:51] 	 (03PS49) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:36:40] 	 Hi apergos I have another question for ou
[12:36:50] 	 apergos: if now is the time, obviously :)
[12:36:59] 	 perhaps we'll get lucky and I'll have another answer for you :-)
[12:37:03] 	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[12:38:48] 	 apergos: I think the `contributor.id` field has changed when null in 2019-04 dump version - Can you confirm?
[12:39:12] 	 can you give me a specific example?
[12:39:30] 	 I can !
[12:39:31] 	 the schema has not changed so without further information I'm not sure what difference you are seeing
[12:40:59] 	 apergos: https://gist.github.com/jobar/2b3d0e6b3d24fff5ad5909c70328aac4
[12:41:27] 	 apergos: What I see are empty fields () while they were not here last month AFAIK
[12:41:30] 	 all right, I'll look at what the previous version had (unless you have that handy)
[12:42:22] 	 phab paste is  a good place for snippets too (dunno if we need a ticket for this but we might)
[12:42:23] 	 apergos: I can find that
[12:44:07] 	 Actually apergos - problem is seen on 2019-05 files - currently checking on 2019-04
[12:46:59] 	 great, thanks
[12:47:36] 	 (03PS50) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:52:06] 	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10fgiunchedi) >>! In T219544#5206319, @Ottomata wrote: > Alright, I've written a bash wrapper to help out with this.  I'd do it with just the swif...
[12:53:00] 	 apergos: Counting the number of "< id/>" string -- in 20190501/zhwikisource: 10311 -- in 20190401/zhwikisource: 0
[12:53:14] 	 apergos: my grep is still running to get the rev-id in previsous dump
[12:53:24] 	 ok
[12:53:32] 	 thanks for taking the time
[12:53:41] 	 apergos: correction of the string: ""
[12:53:51] 	 what's in the stubs?
[12:53:59] 	 eh nm I'll do the work and find out. just if the 
[12:54:16] 	 weird string is there too, you can get a grep done much sooner by grepping from a zcat of the stubs
[12:54:18] 	 apergos: I tried bowiki and it didn't show any empty id
[12:54:20] 	 anyhoo
[12:54:28] 	 of apergos 
[12:54:32] 	 of course apergosd
[12:54:52] 	 ok gtk
[12:55:04] 	 I think I'm going to ask this get turned in to a ticket, do you mind?
[12:55:11] 	 because I'll have some other info to add to it
[12:55:16] 	 apergos: in previous dump value was 0
[12:55:20] 	 and then probably have to poke someone for a mw core fix
[12:55:26] 	 np apergos 
[12:55:30] 	 Creating a ticket :)
[12:55:33] 	 thanks!
[12:55:40] 	 just assign to me straight off
[12:55:43] 	 Sure
[13:00:00] 	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:00:05] 	 hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1300).
[13:00:26] 	 !log swift eqiad-prod: ms-be1033 weight to 1500 - T223518
[13:00:30] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:30] 	 oh the train already? is the european one running this week?
[13:00:31] 	 T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518
[13:04:29] 	 ah yeah train
[13:04:36] 	 apergos: yeah that is me!
[13:04:49] 	 woo hoo!
[13:04:58] 	 apergos: should I hold it a bit?
[13:05:03] 	 not for me
[13:07:34] 	 not sure if this is a services or an ops problem but this request never returns: http://en.wikipedia.org/api/rest_v1/data/citation/mediawiki-basefields/https%3A%2F%2Fwww.babycentre.co.uk%2Fc5112%2Fbefore-you-begin
[13:07:55] 	 it should be at least timing out and returning a response...
[13:08:01] 	 it used to anyway.
[13:08:03] 	 https://phabricator.wikimedia.org/T220731#5207828  you know of this already, it's the only thing I'm aware of
[13:09:53] 	 apergos: https://phabricator.wikimedia.org/T224221
[13:09:54] 	 Thanks :)
[13:10:14] 	 thanks for reporting!
[13:10:18] 	 mvolz:  i can reproduce :)
[13:10:41] 	 I imagine if it is in core (most probaly) then a fix won't make it out to all the groups until the next run (June 1)
[13:15:29] <_joe_>	 mvolz: have you opened a ticket?
[13:15:36] <_joe_>	 that's... citoid
[13:15:44] 	 _joe_: no, and yup :).
[13:16:10] <_joe_>	 mvolz: if you could open a task, it would be easier for some of us to take a look
[13:16:14] 	 ok going to hmm do the train
[13:16:23] 	 choo choo
[13:16:36] <_joe_>	 we're out several people and I'm not sure I have the bandwidth to look into it right now
[13:17:17] 	 (03PS1) 10Hashar: all wikis to 1.34.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512165
[13:17:19] 	 (03CR) 10Hashar: [C: 03+2] all wikis to 1.34.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512165 (owner: 10Hashar)
[13:17:30] <_joe_>	 mvolz: unless it's not a regression and citoid is completely broken, but that's not my understanding
[13:17:31] 	 apergos: well we can cherry pick fixes to the wmf deployment branch
[13:17:53] <_joe_>	 mvolz: also add any relevant logs, IIRC you have access to logstash right
[13:18:13] <_joe_>	 this has nothing to do with core
[13:18:21] <_joe_>	 that url is served by restbase
[13:18:31] 	 (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512165 (owner: 10Hashar)
[13:18:39] <_joe_>	 or you're talking about the dumps issue?
[13:18:39] 	 oh my 
[13:18:48] 	 (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512165 (owner: 10Hashar)
[13:19:23] <_joe_>	 that is related to core indeed, and a cherry-pick could help there, probably, once the fix is ready
[13:19:30] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Mvolz)
[13:19:53] 	 so yeah for dump we can cherry pick stuff
[13:20:18] 	 then Stephane Bisson mentioned "Invalid message parameter CommentStoreComment from EventPresentationModel.php"  https://phabricator.wikimedia.org/T223741
[13:20:23] 	 though that predates the current wmf train
[13:20:29] 	 not worried about dumps
[13:20:54] 	 and I gues sI missed it
[13:21:09] <_joe_>	 so the citoid issue has so many layer of the onion to peel it will take some time
[13:21:30] <_joe_>	 varnish => restbase => citoid => zotero => proxy
[13:21:41] <_joe_>	 *layers
[13:22:27] 	 the commentstore issue is the one I was mentioning, in relation to deployments, and that more of a fyi or 'decide what you want to do'
[13:23:33] 	 _joe_: we need an entreprise service bus to let those services talk to each other
[13:23:35] 	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Great thanks!  > While we're at it I recommend creating the container with the lowlatency storage policy so that swift will allocate o...
[13:23:55] 	 https://upload.wikimedia.org/wikipedia/commons/a/a2/ESB.svg (yellow is the bus)
[13:23:56] * _joe_ puts a target on hashar's back
[13:24:02] 	 I am half kidding hehe
[13:24:07] <_joe_>	 I hope so
[13:24:26] <_joe_>	 you know what that yellow thing is called, in computer science?
[13:24:34] 	 I thought EventBus would be our homergrown reinvented entreprise bus, but apparently it is for something else
[13:24:34] <_joe_>	 a big fat single point of failure
[13:24:49] <_joe_>	 yeah when I heard the term I was immediately triggered
[13:25:10] 	 RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational
[13:25:17] <_joe_>	 although, it is an event bus. Just not a message bus, where services use it to talk to each other
[13:25:18] 	 well you can make the bus resilient :]
[13:25:20] 	 PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:25:28] 	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.6
[13:25:28] 	 PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:25:30] <_joe_>	 looking
[13:25:31] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:36] 	 PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:25:43] 	 most probably due to the deployment
[13:26:07] <_joe_>	 uh
[13:26:13] <_joe_>	 you're doing a deployment?
[13:26:15] 	 though I can't remember it was alarming previously
[13:26:16] 	 lol
[13:26:27] 	 _joe_:  hes doing the train
[13:26:35] 	 Fatal error: entire web request took longer than 60 seconds and timed out 
[13:26:36] 	 RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 75395 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:26:38] 	 that is how this whole discussion started...
[13:26:44] 	 RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:26:47] 	 that the spike of errors we get everytime we do deployment :/
[13:26:50] 	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:27:15] 	 so that if the Icinga probe happens to hit Apache/HHVM at the time we deploy, the probe times out
[13:27:38] <_joe_>	 this is not due to the cdb rebuild btw
[13:29:45] 	 we had a task for that spike of times out, but I can't find it anymore :(
[13:30:48] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack)
[13:30:49] 	 AH https://phabricator.wikimedia.org/T204871
[13:31:31] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack)
[13:33:21] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack)
[13:35:06] 	 (03PS1) 10BBlack: lvs1001-6: remove prod cfg for spare reimage [puppet] - 10https://gerrit.wikimedia.org/r/512169 (https://phabricator.wikimedia.org/T224223)
[13:36:15] 	 hashar: re T223741 if the train is done now, I would like to deploy the hotfix to limit the number of EchoEvent created with this issue in the db. Is it a good time to do it?
[13:36:16] 	 T223741: Invalid message parameter CommentStoreComment from EventPresentationModel.php - https://phabricator.wikimedia.org/T223741
[13:37:03] 	 stephanebisson: yeah I am fine doing it right now
[13:37:35] 	 hashar: OK, you do it or should I?
[13:37:50] 	 stephanebisson: please do :)
[13:38:09] 	 I'm on it
[13:38:29] 	 this way i can focus on the logspam now that 1.34.0-wmf.6 is on all wikis
[13:38:40] 	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2286 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:38:46] 	 PROBLEM - Check systemd state on mw2286 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:38:48] 	 I +1 the change has a token of approval
[13:39:54] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack)
[13:41:04] 	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational
[13:41:51] 	 !log stopped pybal on lvs1001-6 - T224223
[13:41:55] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:56] 	 T224223: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223
[13:46:52] 	 (03PS51) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[13:47:33] 	 stephanebisson: please poke me if anything is needed
[13:47:58] 	 hashar: sure, thanks
[13:52:13] 	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans)
[13:55:09] 	 !log decommissioning restbase1009-b -- T223976
[13:55:13] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:14] 	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[13:56:51] 	 !log sbisson@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/Echo: SWAT: [[gerrit:512070|Don't add CommentStoreComment as plaintext params]] (duration: 00m 50s)
[13:56:54] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:13] 	 hashar: deployment done
[13:59:59] 	 stephanebisson: thank you
[14:01:21] 	 (03CR) 10BBlack: [C: 03+2] lvs1001-6: remove prod cfg for spare reimage [puppet] - 10https://gerrit.wikimedia.org/r/512169 (https://phabricator.wikimedia.org/T224223) (owner: 10BBlack)
[14:02:25] 	 (03CR) 10Gehel: [C: 04-1] icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[14:03:48] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Joe) I tested the actual backend request to citoid  ` curl 'citoid.discovery.wmnet:1970/api?format=mediawiki-basefields&search=https%3A%2F%2Fwww.babycentre.co...
[14:05:06] 	 (03PS1) 10BBlack: eqiad low-traffic: re-order LVS array [puppet] - 10https://gerrit.wikimedia.org/r/512175
[14:11:27] 	 stephanebisson: apparently you are familiar with the GrowthExperiments extension. It seems its help panel might be broken on https://ko.wikipedia.org/ https://phabricator.wikimedia.org/T224224
[14:11:56] 	 well at least the resource loader raise an exception when creating the RL module "ext.growthExperiments.Help"
[14:12:21] 	 hashar: ouch, thanks for the ping, looking into it
[14:13:41] 	 (03CR) 10BBlack: [C: 03+2] eqiad low-traffic: re-order LVS array [puppet] - 10https://gerrit.wikimedia.org/r/512175 (owner: 10BBlack)
[14:13:52] 	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[14:17:23] 	 [{exception_id}] {exception_url} ParseError from line 11 of /srv/mediawiki/php-1.34.0-wmf.6/includes/TemplateParser.php(149) : eval()'d code: syntax error, unexpected '=>' (T_DOUBLE_ARROW), expecting ')' 
[14:17:25] 	 do
[14:17:28] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Joe) Correcting myself:  in most cases, citoid times out after 120 seconds returning an empty response. Sometimes, it returns a 404.
[14:21:24] 	 !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet
[14:21:27] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:37] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:21:37] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:21:40] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:43] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:03] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Pchelolo) Couple of notes:  1. It's broken in beta on citoid as well: `curl 'https://citoid-beta.wmflabs.org/api?format=mediawiki&search=https%3A%2F%2Fwww.bab...
[14:25:59] 	 !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1001.eqiad.wmnet
[14:26:02] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:12] 	 !log reboot thumbor1002
[14:28:15] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:57] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:29:57] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:30:00] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:03] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:40] 	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10fgiunchedi) >>! In T219544#5207981, @Ottomata wrote: > Great thanks! >  >> While we're at it I recommend creating the container with the lowlate...
[14:30:43] 	 (03CR) 10CDanis: [C: 03+1] "one dangling nitpicky thought" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511720 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:31:28] 	 (03CR) 10CDanis: [C: 03+1] icinga: fix location of meta-monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/511721 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:34:43] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Joe) I caught an error in production:  `lang=json {   "name": "citoid",   "hostname": "citoid-production-76db86989b-8td54",   "pid": 16,   "level": 40,   "err...
[14:34:54] 	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational
[14:36:48] 	 !log reboot thumbor1003
[14:36:51] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:10] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:37:10] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:37:13] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:16] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:00] 	 (03CR) 10Herron: "The mgmt addresses 10.193.1.18 and 10.193.1.19 are responding to ping currently, while the others are not.  Just want to check that this i" [dns] - 10https://gerrit.wikimedia.org/r/512069 (owner: 10Papaul)
[14:41:00] 	 (03CR) 10Herron: [C: 03+2] " herron: yes only.18 and .19 are setup on the server for now" [dns] - 10https://gerrit.wikimedia.org/r/512069 (owner: 10Papaul)
[14:41:04] 	 (03PS2) 10Herron: DNS: Add mgmt and production DNS for kafka-main200[1-5] [dns] - 10https://gerrit.wikimedia.org/r/512069 (owner: 10Papaul)
[14:41:44] 	 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Anomie) >>! In T187147#5207128, @tstarling wrote: > * To provide backtraces for catchable fatals. I don't know why this is hard to...
[14:42:39] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org',...
[14:43:49] 	 !log reboot thumbor1004
[14:43:52] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:43:53] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:53] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:43:56] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:58] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:03] 	 10Operations, 10Citoid, 10Services: Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10Joe) Anyways, this is clearly a citoid issue, as it can be reproduced in beta as well.
[14:44:50] 	 (03PS52) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[14:45:04] 	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[14:45:51] 	 (03CR) 10jerkins-bot: [V: 04-1] icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[14:46:06] 	 :/
[14:46:33] 	 (03CR) 10Ori.livneh: "Giuseppe, can I ask you to reconsider?" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[14:48:10] 	 (03CR) 10CDanis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:48:19] 	 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10aborrero) I wonder if puppet `subscribe =>` and/or `notify =>` mechanisms would work in this case.  We could link the etcd service to the certificate file,...
[14:48:32] 	 (03PS53) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[14:50:31] 	 !log reboot thumbor2001
[14:50:34] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:19] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:51:19] 	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[14:51:22] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:25] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:32] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:51:32] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:51:35] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:38] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:16] 	 10Operations, 10Citoid, 10RESTBase-API, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Some citoid requests aren't timing out and are pending indefinitely - https://phabricator.wikimedia.org/T224222 (10mobrovac)
[14:52:19] 	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) I'm rolling back the size of APCu to 512M after seeing how, on mw1261, the occupied memory grew significantly but the cache hit ratio didn't. IMHO it only risks to creat...
[14:53:28] <_joe_>	 hashar: where can I find it on logstash?
[14:55:13] 	 (03CR) 10Volans: icinga: fix meta-monitoring sync script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511720 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:55:21] 	 _joe_: it ?
[14:55:35] <_joe_>	 the exceptions you were reporting earlier
[14:56:13] 	 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10BBlack) A few thoughts:  * None of our CI on ops/puppet provides very strong guarantees of correctness regardless of what we do here.  It's possible for the automatic CI to succ...
[14:56:33] 	 _joe_ for exceptions, we usually fill the request id   so in the logstash search field one can just search it:  reId:xXXXzzzAAAy
[14:56:56] <_joe_>	 yeah, can you make an example?
[14:57:21] 	 !log reboot thumbor2002
[14:57:25] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:33] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[14:57:34] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:57:37] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:40] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:05] 	 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10BBlack) One more:  * Automatic merges are not fool-proof in general.  This is clear in the separate-file case (e.g. changing two separate manifests in two seemingly-unrelated mo...
[15:01:09] 	 (03CR) 10Gehel: [C: 04-1] "Looks like this works as expected. Some style issues and we should be good to merge." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[15:02:22] 	 !log reboot thumbor2003
[15:02:25] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[15:02:25] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:02:26] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:29] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:32] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:56] 	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul)
[15:05:40] 	 10Operations, 10Release Pipeline, 10Services, 10serviceops, and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac)
[15:07:37] 	 (03PS2) 10Bstorm: wiki replicas: Improve index usage for queries against revision_userindex [puppet] - 10https://gerrit.wikimedia.org/r/511910 (https://phabricator.wikimedia.org/T221339) (owner: 10Anomie)
[15:09:27] 	 (03CR) 10Bstorm: "Yup!  That did it.  This is the resulting definition:" [puppet] - 10https://gerrit.wikimedia.org/r/511910 (https://phabricator.wikimedia.org/T221339) (owner: 10Anomie)
[15:10:17] 	 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Volans) @RobH @faidon @crusnov: I've made the changes to the Lifecycle page, please have a look: https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&di...
[15:10:30] 	 (03PS1) 10Ottomata: Add swift analytics_admin dummy account key [labs/private] - 10https://gerrit.wikimedia.org/r/512183 (https://phabricator.wikimedia.org/T219544)
[15:11:20] 	 (03PS3) 10Jforrester: FlaggedRevisions: Copy in rest of the config, for static registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512053 (owner: 10Reedy)
[15:13:50] 	 _joe_: sorry I have missed your reply. Which task / exception are you looking for ?; )
[15:14:00] 	 (03PS3) 10Jforrester: Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy)
[15:14:11] <_joe_>	 hashar: nvmd, found it :D
[15:14:52] 	 hashar: re T224224 This is really bad for us and the kowiki community. Any chance I can push a config change ASAP?
[15:14:52] 	 T224224: [GrowthExperiments] Sessions are disabled for this entry point - https://phabricator.wikimedia.org/T224224
[15:15:49] 	 stephanebisson: yeah we can either hotfix it, or just rollback
[15:15:59] <_joe_>	 stephanebisson: if it's a config change sure, If it unbreaks the sites SRE can help too
[15:16:00] 	 I have no idea though why it whould only be on kowiki though
[15:16:23] 	 (03PS1) 10Ottomata: Add Swift analytics account with analytics:admin user [puppet] - 10https://gerrit.wikimedia.org/r/512184 (https://phabricator.wikimedia.org/T219544)
[15:17:52] 	 hashar: Details on the task. We have a plan to bypass the consequences while investigating the root cause (something in Core?).
[15:20:11] 	 (03PS1) 10Sbisson: Hardcode korean help desk config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512186 (https://phabricator.wikimedia.org/T224224)
[15:20:50] 	 hashar: this is the config change: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/512186
[15:21:06] 	 ;]
[15:21:49] 	 hashar, _joe_: do I have your blessing to deploy it now?
[15:22:05] 	 (03CR) 10Hashar: [C: 03+2] "Yes lets hotfix it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512186 (https://phabricator.wikimedia.org/T224224) (owner: 10Sbisson)
[15:22:08] 	 stephanebisson: blessed :]
[15:22:18] 	 as for the root cause I have no idea really :-
[15:22:20] 	 (
[15:23:01] 	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add swift analytics_admin dummy account key [labs/private] - 10https://gerrit.wikimedia.org/r/512183 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[15:23:05] 	 (03Merged) 10jenkins-bot: Hardcode korean help desk config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512186 (https://phabricator.wikimedia.org/T224224) (owner: 10Sbisson)
[15:23:15] 	 hashar: can you sync it?
[15:23:23] 	 (03CR) 10jenkins-bot: Hardcode korean help desk config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512186 (https://phabricator.wikimedia.org/T224224) (owner: 10Sbisson)
[15:23:31] 	 (03CR) 10Ottomata: [C: 03+2] Add Swift analytics account with analytics:admin user [puppet] - 10https://gerrit.wikimedia.org/r/512184 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[15:23:42] 	 stephanebisson: yes doing it 
[15:23:45] 	 hashar: note that we can test on a debug server
[15:24:17] 	 stephanebisson: it is on mwdebug1001 now
[15:24:23] 	 testing..
[15:24:28] 	 <3
[15:24:48] 	 there is one thing I love when running the train, is talking to others and feeling more or less helpful ;]
[15:24:52] 	 small wins!
[15:26:34] 	 hashar: LGTM
[15:27:50] 	 (03PS2) 1020after4: phab: Configure sapis for all php extensions [puppet] - 10https://gerrit.wikimedia.org/r/512082 (https://phabricator.wikimedia.org/T151070)
[15:30:26] 	 stephanebisson: deploying!
[15:31:01] 	 stephanebisson: and we will have to change the value in June :D
[15:31:10] 	 !log reboot thumbor2004
[15:31:14] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:23] 	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[15:31:24] 	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:31:26] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:29] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:34] 	 hashar: if we still haven't figured it out... we will!
[15:31:37] 	 !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Hardcode korean help desk config - T224224 (duration: 00m 48s)
[15:31:42] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:42] 	 T224224: [GrowthExperiments] Sessions are disabled for this entry point - https://phabricator.wikimedia.org/T224224
[15:33:18] 	 (03CR) 1020after4: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/16735/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/512082 (https://phabricator.wikimedia.org/T151070) (owner: 1020after4)
[15:33:28] 	 !log rolling restart of swift-proxy to apply creation of analytics_admin account
[15:33:31] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:03] 	 I am done for today! train seems al lright :]
[15:42:43] 	 10Operations, 10netops, 10Patch-For-Review: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10jbond) just watching ripe presentation and thought this may be of interest https://ripe78.ripe.net/archives/video/106
[15:47:48] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1001.wikimedia.org', 'lvs1004.wikimedia.org', 'lvs1006.wikimedia.org', 'lvs1002.wikimedia.org',...
[15:50:10] 	 (03CR) 10Anomie: [C: 03+1] "+1 to the changes in PS2." [puppet] - 10https://gerrit.wikimedia.org/r/511910 (https://phabricator.wikimedia.org/T221339) (owner: 10Anomie)
[15:53:17] 	 10Operations, 10MediaWiki-extensions-PdfHandler, 10Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10Schtom) ubtuntu 16.04 imagemagick: Version: ImageMagick 6.8.9-9 Q16 x86_64 2018-09-28 ghostscript:...
[15:53:33] 	 (03CR) 10Dzahn: [C: 03+2] phab: Configure sapis for all php extensions [puppet] - 10https://gerrit.wikimedia.org/r/512082 (https://phabricator.wikimedia.org/T151070) (owner: 1020after4)
[15:56:40] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:56:41] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:56:43] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:46] 	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[15:56:46] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:47] 	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:56:50] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:53] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:08] 	 !log rebooting furud/flerovium for kernel updates
[15:57:11] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] 	 godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1600).
[16:00:04] 	 No GERRIT patches in the queue for this window AFAICS.
[16:06:12] 	 elukey_: after a long time.. finally fixed it looks.. (phab worker process memory leaks) https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&panelId=7&fullscreen&from=now-30d&to=now
[16:13:38] 	 !log restarting phd on phab1003 to pick up new php module config 
[16:13:41] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:48] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn)
[16:15:31] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) Phabricator has been switched to phab1003 as the prod server now and that meant:  - php 5 to...
[16:16:31] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) I am thinking now we could make the process easier and just keep phab1003 as the prod server...
[16:17:25] 	 (03CR) 10Ottomata: "Hm!" [puppet] - 10https://gerrit.wikimedia.org/r/511690 (owner: 10CDanis)
[16:18:09] 	 paladox: https://phabricator.wikimedia.org/T125357#5208481
[16:18:20] 	 yup, loads now!
[16:18:28] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10mmodell) @dzahn: agreed. I don't know who should decide if we keep phab1001 or return it. We've pro...
[16:19:50] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) I will bring this up in my next subteam discussion meeting which should be in a week. Until...
[16:20:23] 	 paladox: :)
[16:20:33] 	 worked for me too: cc: twentyafterfour 
[16:21:28] 	 paladox: also see the grafana link above :)
[16:21:39] 	 the one twentyafterfour links to?
[16:21:41] 	 no more leaking finally
[16:21:46] 	 https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&panelId=7&fullscreen&from=now-30d&to=now
[16:22:07] 	 should we update that graph for it to use phab1003?
[16:25:06] 	 actually https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&from=now-7d&to=now needs to be updated to use phab1003 (all of that)
[16:25:09] 	 cc mutante twentyafterfour ^^
[16:27:01] 	 paladox: uh... yes
[16:27:05] 	 (03CR) 10BBlack: [C: 03+1] "Looks useful! Probably shouldn't mess with X-Cache, as lots of things would be perturbed by the change in parsing of it." [puppet] - 10https://gerrit.wikimedia.org/r/511690 (owner: 10CDanis)
[16:27:51] 	 paladox: not even sure if it's just the label or we see the wrong data 
[16:28:19] 	 but i gotta run for a moment. will be back later
[16:28:43] 	 ok
[16:29:33] 	 PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:30:51] 	 i think the graph only needs updating (looking at the exported json shows a hardcoded phab1001)
[16:31:17] 	 10Operations, 10Analytics, 10Traffic: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis)
[16:32:02] 	 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10Anomie) I have no opinion. I don't think the existing 50MB limit was chosen for any specific reason.
[16:33:49] 	 (03PS2) 10CDanis: varnishnsca webrequest: log Server: in response as 'backend' [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236)
[16:35:13] 	 PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:38:28] 	 (03PS7) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507)
[16:39:01] 	 (03CR) 10jerkins-bot: [V: 04-1] Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov)
[16:41:00] 	 (03CR) 10Anomie: [C: 03+1] "> Therefore, I may experiment with doing that on this patch." [puppet] - 10https://gerrit.wikimedia.org/r/510595 (https://phabricator.wikimedia.org/T223406) (owner: 10Anomie)
[16:42:40] 	 (03PS8) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507)
[16:43:50] 	 (03PS9) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507)
[16:44:23] 	 (03CR) 10jerkins-bot: [V: 04-1] Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov)
[16:44:28] 	 (03PS1) 10BBlack: lvs1001-6: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/512189
[16:45:15] 	 (03PS54) 10Mathew.onipe: icinga: create and apply cirrus settings check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[16:45:40] 	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus settings check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[16:46:05] 	 (03CR) 10jerkins-bot: [V: 04-1] icinga: create and apply cirrus settings check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[16:47:00] 	 (03PS10) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507)
[16:47:37] 	 (03PS2) 10BBlack: lvs1001-6: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/512189 (https://phabricator.wikimedia.org/T224223)
[16:47:39] 	 (03PS1) 10BBlack: lvs1001-6: remove jessie-installer settings [puppet] - 10https://gerrit.wikimedia.org/r/512190 (https://phabricator.wikimedia.org/T224223)
[16:48:11] 	 (03PS55) 10Mathew.onipe: icinga: create and apply cirrus settings check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[16:50:16] 	 (03CR) 10BBlack: [C: 03+2] lvs1001-6: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/512189 (https://phabricator.wikimedia.org/T224223) (owner: 10BBlack)
[16:50:19] 	 (03CR) 10BBlack: [C: 03+2] lvs1001-6: remove jessie-installer settings [puppet] - 10https://gerrit.wikimedia.org/r/512190 (https://phabricator.wikimedia.org/T224223) (owner: 10BBlack)
[16:50:37] 	 (03CR) 10CRusnov: [C: 03+1] "merge away" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/511969 (owner: 10Ayounsi)
[16:51:31] 	 10Operations, 10Discovery-Search (Current work): Cleanup puppet hieradata for logstash - https://phabricator.wikimedia.org/T224074 (10Mathew.onipe) 05Open→03Resolved
[17:00:04] 	 cscott, arlolra, subbu, and halfak: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1700).
[17:01:24] 	 no parsoid deploy today
[17:02:01] 	 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10Bstorm) But puppet doesn't run the agent like normal when it modifies a cert.  It waits for signature, etc.?   However, now that you mention it, I think I...
[17:03:47] 	 (03PS3) 10Andrew Bogott: nova: make all services active/active [puppet] - 10https://gerrit.wikimedia.org/r/511950 (https://phabricator.wikimedia.org/T223905)
[17:03:49] 	 (03PS2) 10Andrew Bogott: designate: make designate nodes active/active [puppet] - 10https://gerrit.wikimedia.org/r/512081 (https://phabricator.wikimedia.org/T223905)
[17:03:51] 	 (03PS1) 10Andrew Bogott: neutron: make the neutron api server ('neutron-server') active/active [puppet] - 10https://gerrit.wikimedia.org/r/512192 (https://phabricator.wikimedia.org/T223905)
[17:08:16] 	 (03PS3) 10Bstorm: wiki replicas: Improve index usage for queries against revision_userindex [puppet] - 10https://gerrit.wikimedia.org/r/511910 (https://phabricator.wikimedia.org/T221339) (owner: 10Anomie)
[17:09:53] 	 (03PS4) 10Andrew Bogott: nova: make all services active/active [puppet] - 10https://gerrit.wikimedia.org/r/511950 (https://phabricator.wikimedia.org/T223905)
[17:09:55] 	 (03PS3) 10Andrew Bogott: designate: make designate nodes active/active [puppet] - 10https://gerrit.wikimedia.org/r/512081 (https://phabricator.wikimedia.org/T223905)
[17:09:57] 	 (03PS2) 10Andrew Bogott: neutron: make the neutron api server ('neutron-server') active/active [puppet] - 10https://gerrit.wikimedia.org/r/512192 (https://phabricator.wikimedia.org/T223905)
[17:10:37] 	 (03CR) 10Bstorm: [C: 03+2] wiki replicas: Improve index usage for queries against revision_userindex [puppet] - 10https://gerrit.wikimedia.org/r/511910 (https://phabricator.wikimedia.org/T221339) (owner: 10Anomie)
[17:12:20] 	 (03PS1) 10Cwhite: logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103)
[17:16:54] 	 (03PS1) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195
[17:16:56] 	 (03PS1) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512196
[17:17:38] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair)
[17:18:19] 	 RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational
[17:24:09] 	 RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational
[17:34:13] 	 10Operations, 10DC-Ops, 10netops, 10observability: Send some LibreNMS alerts to dcops and netops only - https://phabricator.wikimedia.org/T224180 (10Krenair)
[17:34:16] 	 (03CR) 10Herron: "Nice! Question for you inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[17:41:36] 	 (03PS1) 10Ottomata: Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544)
[17:42:29] 	 (03CR) 10jerkins-bot: [V: 04-1] Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[17:47:17] 	 (03PS2) 10Ottomata: Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544)
[17:53:50] 	 (03CR) 10Ottomata: [C: 03+2] Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[17:53:58] 	 (03PS3) 10Ottomata: Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544)
[17:54:02] 	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Install python3-swiftclient on analytics cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/512203 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[17:57:33] 	 (03PS2) 10Cwhite: logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103)
[17:58:44] 	 (03CR) 10Cwhite: logstash: add deprecated-input tag to deprecated inputs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[18:00:04] 	 MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1800).
[18:00:04] 	 ottomata: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:06:11] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ayounsi)
[18:10:37] 	 (03CR) 10Ayounsi: [C: 03+2] Format README, remove mention to oldhardware.py [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/511969 (owner: 10Ayounsi)
[18:13:16] 	 is SWAT happening? and if so, can i add a no-op patch?
[18:13:38] 	 (03PS4) 10Bartosz Dziewoński: Simplify VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511016 (https://phabricator.wikimedia.org/T223793)
[18:13:47] 	 (03PS4) 10Bartosz Dziewoński: Simplify VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511029 (https://phabricator.wikimedia.org/T223793)
[18:24:38] 	 well, i added it to the SWAT
[18:24:49] 	 i would be happy to learn that it is actually going to happen
[18:40:34] 	 (03PS1) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[18:41:26] 	 (03CR) 10jerkins-bot: [V: 04-1] Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[18:42:37] 	 heh i put a no-op on there too 
[18:49:52] 	 (03PS2) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[18:54:32] 	 (03PS1) 10CRusnov: profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212
[18:54:34] 	 !log decommissioning restbase1009-c -- T223976
[18:54:39] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:39] 	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[18:55:05] 	 (03CR) 10jerkins-bot: [V: 04-1] profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212 (owner: 10CRusnov)
[18:55:28] 	 (03CR) 10Ottomata: "Looks good in https://puppet-compiler.wmflabs.org/compiler1002/16740/an-master1001.eqiad.wmnet/change.an-master1001.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[18:59:54] 	 (03PS2) 10CRusnov: profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212
[19:00:04] 	 Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T1900)
[19:00:38] 	 (03PS1) 10CRusnov: profile::librenms: Add fake secret for dbpassword [labs/private] - 10https://gerrit.wikimedia.org/r/512216
[19:04:21] 	 (03PS3) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[19:05:49] 	 (03PS1) 10Acamicamacaraca: Enable VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220
[19:06:36] 	 (03PS2) 10Acamicamacaraca: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220
[19:12:59] 	 (03PS3) 10Acamicamacaraca: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220
[19:14:47] 	 (03PS4) 10Acamicamacaraca: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024)
[19:19:26] 	 (03PS3) 10Cwhite: logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103)
[19:42:37] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:42:47] 	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[19:44:11] 	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:44:59] 	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[19:46:47] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:49:27] 	 (03PS4) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751
[19:51:30] 	 (03CR) 10Ori.livneh: "Patch set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[19:52:01] 	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[19:52:35] 	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[19:52:35] 	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:53:47] 	 (03PS5) 10Andrew Bogott: nova: make all services active/active [puppet] - 10https://gerrit.wikimedia.org/r/511950 (https://phabricator.wikimedia.org/T223905)
[19:55:42] 	 (03CR) 10Andrew Bogott: [C: 03+2] nova: make all services active/active [puppet] - 10https://gerrit.wikimedia.org/r/511950 (https://phabricator.wikimedia.org/T223905) (owner: 10Andrew Bogott)
[19:58:56] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) >>! In T223902#5203523, @Vgutierrez wrote: > so, after a quick check you should consider several things: > * wikimedia.org is a can...
[20:02:37] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) >>! In T223902#5209071, @Andrew wrote: >>>! In T223902#5203523, @Vgutierrez wrote: >> so, after a quick check you should consider...
[20:08:48] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Vgutierrez) Right.. that ldap service certificate it's being handled by acme-chief and as Alex explained the *.wikimedia.org limitation onl...
[20:12:16] 	 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808)
[20:12:37] 	 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808)
[20:21:05] 	 (03PS4) 10Andrew Bogott: designate: make designate nodes active/active [puppet] - 10https://gerrit.wikimedia.org/r/512081 (https://phabricator.wikimedia.org/T223905)
[20:24:30] 	 (03CR) 10Andrew Bogott: [C: 03+2] designate: make designate nodes active/active [puppet] - 10https://gerrit.wikimedia.org/r/512081 (https://phabricator.wikimedia.org/T223905) (owner: 10Andrew Bogott)
[20:27:56] 	 (03CR) 10BryanDavis: [C: 03+1] "LGTM. The linter would yell about some whitespace nits in the example, but humans should be able to read it. :)" [puppet] - 10https://gerrit.wikimedia.org/r/512136 (owner: 10Jbond)
[20:29:21] 	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[20:31:24] 	 (03CR) 10Ayounsi: [C: 03+1] profile::librenms: Add fake secret for dbpassword [labs/private] - 10https://gerrit.wikimedia.org/r/512216 (owner: 10CRusnov)
[20:32:22] 	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul)
[20:38:01] 	 (03PS1) 10Dzahn: webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234
[20:38:23] 	 (03PS3) 10CRusnov: profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212
[20:38:33] 	 (03CR) 10jerkins-bot: [V: 04-1] webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234 (owner: 10Dzahn)
[20:39:21] 	 (03PS1) 10EBernhardson: Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200)
[20:39:26] 	 (03PS2) 10Dzahn: webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234
[20:39:54] 	 (03CR) 10jerkins-bot: [V: 04-1] Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson)
[20:40:17] 	 (03CR) 10jerkins-bot: [V: 04-1] webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234 (owner: 10Dzahn)
[20:41:24] 	 (03PS1) 10Papaul: DHCP: Add MAC address for kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/512236 (https://phabricator.wikimedia.org/T223492)
[20:41:51] 	 (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/512212 (owner: 10CRusnov)
[20:43:30] 	 (03PS3) 10Andrew Bogott: neutron: make the neutron api server ('neutron-server') active/active [puppet] - 10https://gerrit.wikimedia.org/r/512192 (https://phabricator.wikimedia.org/T223905)
[20:45:01] 	 (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/512236 (https://phabricator.wikimedia.org/T223492) (owner: 10Papaul)
[20:45:14] 	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] profile::librenms: Add fake secret for dbpassword [labs/private] - 10https://gerrit.wikimedia.org/r/512216 (owner: 10CRusnov)
[20:46:47] 	 (03PS4) 10Andrew Bogott: neutron: make the neutron api server ('neutron-server') active/active [puppet] - 10https://gerrit.wikimedia.org/r/512192 (https://phabricator.wikimedia.org/T223905)
[20:47:30] 	 (03PS3) 10Dzahn: webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234
[20:47:53] 	 (03CR) 10Andrew Bogott: [C: 03+2] neutron: make the neutron api server ('neutron-server') active/active [puppet] - 10https://gerrit.wikimedia.org/r/512192 (https://phabricator.wikimedia.org/T223905) (owner: 10Andrew Bogott)
[20:49:54] 	 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn)
[20:50:01] 	 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) a:03Dzahn
[20:50:14] 	 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) p:05Triage→03Normal
[20:52:12] 	 (03PS2) 10EBernhardson: Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200)
[20:54:03] 	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) oops, wrong ticket for the Gerrit comment
[20:54:14] 	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) Change 512236 had a related patch set uploaded (by Papaul; owner: Papaul): [operations/puppet@production] DHCP: Add MAC address for kafka-main2001  https://gerrit.wikime...
[20:55:47] 	 (03CR) 10EBernhardson: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/16746/" [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson)
[21:11:54] 	 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn)
[21:12:33] 	 (03PS4) 10Dzahn: webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234 (https://phabricator.wikimedia.org/T224194)
[21:15:45] 	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) @herron can't find a partman recipe   ────────────────────────┤ [!!] Partition disks ├─────────────────────────┐      │...
[21:16:08] 	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16748/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/512234 (https://phabricator.wikimedia.org/T224194) (owner: 10Dzahn)
[21:16:55] 	 (03PS5) 10Dzahn: webserver_misc_apps: move httpd setup to a profile [puppet] - 10https://gerrit.wikimedia.org/r/512234 (https://phabricator.wikimedia.org/T224194)
[21:23:15] 	 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) p:05Triage→03Normal
[21:23:48] 	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) ok, no worries I'll poke at this for a bit and try to get it installed
[21:31:49] 	 (03CR) 10Herron: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[21:33:32] 	 (03PS3) 10Dzahn: webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194)
[21:33:41] 	 chaomodus: do you have grafana admin rights or should we all have the same privileges to edit dashboards? just curious
[21:33:58] 	 i know grafana-admin is not a thing anymore
[21:34:00] 	 idk i can generally edit boards though
[21:34:05] 	 (03CR) 10jerkins-bot: [V: 04-1] webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194) (owner: 10Dzahn)
[21:34:14] 	 hm, ok. thanks for changing the one for phabricator
[21:35:00] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs1001.wikimedia.org', 'lvs1002.wikimedia.org',...
[21:35:35] 	 no worries ;)
[21:37:45] 	 https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&from=now-2d&to=now&panelId=7&fullscreen looks much much more healthier then https://phab.wmfusercontent.org/file/data/ndufvobxbczsdup2yjkl/PHID-FILE-hj7p5fug3ekmspr3acct/Screenshot_from_2019-05-23_11-05-11.png
[21:38:26] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) I am fine with changing our proposed names to things like keystone-eqiad1.wikimedia.org or keystone-eqiad1-wmcs.wikimedia.org if th...
[21:38:51] 	 (03PS4) 10Dzahn: webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194)
[21:39:23] 	 (03CR) 10jerkins-bot: [V: 04-1] webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194) (owner: 10Dzahn)
[21:40:41] 	 (03PS5) 10Dzahn: webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194)
[21:41:57] 	 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Paladox) Updated image for phab1003:  {F29226685}  It looks much much more hea...
[21:44:57] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) Another thing to consider if we're really talking about using the prod caches is that currently those endpoints are not exposed to...
[21:52:02] 	 (03CR) 10Dzahn: [C: 03+2] webserver_misc_apps: add PHP7.2 APT repository on stretch [puppet] - 10https://gerrit.wikimedia.org/r/512066 (https://phabricator.wikimedia.org/T224194) (owner: 10Dzahn)
[21:53:59] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:54:31] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:54:43] 	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[21:54:49] 	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:54:55] 	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[21:58:24] 	 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Volans) p:05Triage→03Normal
[21:58:47] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:59:09] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:01:16] 	 (03PS1) 10Dzahn: phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205)
[22:01:48] 	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: stop paging SRE for process checks, keep for https [puppet] - 10https://gerrit.wikimedia.org/r/512290 (https://phabricator.wikimedia.org/T224205) (owner: 10Dzahn)
[22:03:15] 	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[22:03:31] 	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[22:04:35] 	 (03PS1) 10Dzahn: nagios_common: update phabricator contact group members [puppet] - 10https://gerrit.wikimedia.org/r/512291 (https://phabricator.wikimedia.org/T224205)
[22:05:13] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1004.wikimedia.org', 'lvs1002.wikimedia.org', 'lvs1005.wikimedia.org', 'lvs1001.wikimedia.org',...
[22:05:32] 	 (03CR) 10Dzahn: "anyone else who should be paged for phab things?" [puppet] - 10https://gerrit.wikimedia.org/r/512291 (https://phabricator.wikimedia.org/T224205) (owner: 10Dzahn)
[22:08:00] 	 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10BBlack) Do these belong in `wikimedia.org` at all?  It seems this has already been discussed, but I guess I lack some context.  The comment...
[22:09:00] 	 I'll do some config deploys that didn't get done during SWAT.
[22:09:10] 	 (03CR) 10Jbond: [C: 03+2] striker: add example documentation [puppet] - 10https://gerrit.wikimedia.org/r/512136 (owner: 10Jbond)
[22:09:16] 	 (03CR) 10Jforrester: [C: 03+2] Simplify VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511016 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:09:59] 	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:10:21] 	 (03Merged) 10jenkins-bot: Simplify VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511016 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:10:29] 	 (03PS1) 10Dzahn: nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292
[22:10:45] 	 (03CR) 10jenkins-bot: Simplify VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511016 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:11:10] 	 (03PS2) 10Jbond: striker: add example documentation [puppet] - 10https://gerrit.wikimedia.org/r/512136
[22:12:37] 	 (03CR) 10CRusnov: [C: 03+2] profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212 (owner: 10CRusnov)
[22:13:06] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T223793 Add wmgVisualEditorIsSecondaryEditor to InitialiseSettings (duration: 00m 49s)
[22:13:11] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:11] 	 T223793: On non-SET wikis (two edit tabs), links to new pages (red links) should open the user's preferred editor (last used) - https://phabricator.wikimedia.org/T223793
[22:13:15] 	 (03CR) 10Jforrester: [C: 03+2] Simplify VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511029 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:14:14] 	 (03Merged) 10jenkins-bot: Simplify VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511029 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:14:24] 	 (03PS4) 10CRusnov: profile::librenms: Minor refactor for parameters and type hints [puppet] - 10https://gerrit.wikimedia.org/r/512212
[22:15:12] 	 (03CR) 10BryanDavis: [C: 03+1] wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond)
[22:16:34] 	 (03CR) 10jenkins-bot: Simplify VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511029 (https://phabricator.wikimedia.org/T223793) (owner: 10Bartosz Dziewoński)
[22:17:06] 	 (03CR) 10Jbond: [C: 03+2] wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond)
[22:17:15] 	 (03PS4) 10Jbond: wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769
[22:17:51] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T223793 Read wmgVisualEditorIsSecondaryEditor in CommonSettings (duration: 00m 48s)
[22:18:02] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:38] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T223793 Drop wmgVisualEditorSingleEditTabSecondaryEditor and wmgVisualEditorSecondaryTabs from InitialiseSettings (duration: 00m 48s)
[22:19:42] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:42] 	 T223793: On non-SET wikis (two edit tabs), links to new pages (red links) should open the user's preferred editor (last used) - https://phabricator.wikimedia.org/T223793
[22:19:56] 	 (03PS4) 10Jforrester: Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148)
[22:20:06] 	 (03CR) 10Jforrester: [C: 03+2] Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester)
[22:21:07] 	 (03Merged) 10jenkins-bot: Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester)
[22:21:21] 	 (03CR) 10jenkins-bot: Duplicate …Squid variables into …Cdn ahead of MW renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496847 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester)
[22:23:28] 	 !log jforrester@deploy1001 Synchronized wmf-config/reverse-proxy.php: T104148 Duplicate …Squid variables into …Cdn ahead of MW renaming, part 1 (duration: 00m 48s)
[22:23:32] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:33] 	 T104148: Change Squid references in Wikimedia configuration files - https://phabricator.wikimedia.org/T104148
[22:24:36] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T104148 Duplicate …Squid variables into …Cdn ahead of MW renaming, part 2 (duration: 00m 48s)
[22:24:39] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:44] 	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[22:25:32] 	 (03PS4) 10Jforrester: Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848
[22:25:44] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T104148 Duplicate …Squid variables into …Cdn ahead of MW renaming, part 3 (duration: 00m 47s)
[22:25:54] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:59] 	 (03PS4) 10Jforrester: Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849
[22:26:45] 	 (03CR) 10Jforrester: [C: 03+2] Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 (owner: 10Jforrester)
[22:27:46] 	 (03Merged) 10jenkins-bot: Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 (owner: 10Jforrester)
[22:28:01] 	 (03CR) 10jenkins-bot: Stop reading wmgUseClusterSquid, never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496848 (owner: 10Jforrester)
[22:29:05] 	 (03CR) 10Jforrester: [C: 03+2] Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849 (owner: 10Jforrester)
[22:29:43] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop reading wmgUseClusterSquid, never varied (duration: 00m 47s)
[22:29:50] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:04] 	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 251.51 ms
[22:30:08] 	 (03Merged) 10jenkins-bot: Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849 (owner: 10Jforrester)
[22:30:23] 	 (03CR) 10jenkins-bot: Stop setting wmgUseClusterSquid, never varied, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496849 (owner: 10Jforrester)
[22:31:22] 	 (03PS1) 10CRusnov: profile::librenms: Add dummy irc password. [labs/private] - 10https://gerrit.wikimedia.org/r/512293
[22:32:12] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wmgUseClusterSquid, never varied, no longer used (duration: 00m 48s)
[22:32:15] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:47] 	 (03PS6) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003
[22:33:55] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack)
[22:34:51] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester)
[22:35:33] 	 (03PS1) 10CRusnov: profile::librenms: Fix irc password [puppet] - 10https://gerrit.wikimedia.org/r/512294
[22:35:53] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester)
[22:36:07] 	 (03CR) 10jerkins-bot: [V: 04-1] profile::librenms: Fix irc password [puppet] - 10https://gerrit.wikimedia.org/r/512294 (owner: 10CRusnov)
[22:36:14] 	 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10BBlack) a:05BBlack→03ayounsi These are reimaged to `role(spare::system)` now.  Over to @ayounsi for getting rid of all the special cases related to thes...
[22:36:19] 	 (03CR) 10jenkins-bot: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester)
[22:37:29] 	 (03PS2) 10CRusnov: profile::librenms: Fix irc password [puppet] - 10https://gerrit.wikimedia.org/r/512294
[22:37:58] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup I, CommonSettings (duration: 00m 48s)
[22:38:10] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:04] 	 (03CR) 10Ayounsi: [C: 03+1] profile::librenms: Fix irc password [puppet] - 10https://gerrit.wikimedia.org/r/512294 (owner: 10CRusnov)
[22:39:17] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup I, InitialiseSettings (duration: 00m 47s)
[22:39:20] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:47] 	 (03PS3) 10Jforrester: Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004
[22:39:51] 	 (03CR) 10Ayounsi: [C: 03+1] profile::librenms: Add dummy irc password. [labs/private] - 10https://gerrit.wikimedia.org/r/512293 (owner: 10CRusnov)
[22:40:00] 	 (03CR) 10CRusnov: [C: 03+2] profile::librenms: Fix irc password [puppet] - 10https://gerrit.wikimedia.org/r/512294 (owner: 10CRusnov)
[22:40:32] 	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] profile::librenms: Add dummy irc password. [labs/private] - 10https://gerrit.wikimedia.org/r/512293 (owner: 10CRusnov)
[22:41:12] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 (owner: 10Jforrester)
[22:41:28] 	 (03PS4) 10Jforrester: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005
[22:42:12] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 (owner: 10Jforrester)
[22:42:27] 	 (03CR) 10jenkins-bot: Invariant config cleanup: II - Account and anti-abuse settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501004 (owner: 10Jforrester)
[22:43:54] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup II, CommonSettings (duration: 00m 48s)
[22:44:02] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:26] 	 (03PS4) 10Jforrester: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006
[22:44:46] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 (owner: 10Jforrester)
[22:44:51] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup II, InitialiseSettings (duration: 00m 48s)
[22:44:55] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:51] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 (owner: 10Jforrester)
[22:46:40] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 (owner: 10Jforrester)
[22:47:04] 	 (03PS4) 10Jforrester: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007
[22:47:17] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup III, CommonSettings (duration: 00m 48s)
[22:47:28] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:39] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 (owner: 10Jforrester)
[22:48:07] 	 (03CR) 10jenkins-bot: Invariant config cleanup: III - SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501005 (owner: 10Jforrester)
[22:49:11] 	 (03PS5) 10Jforrester: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007
[22:50:09] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup III, InitialiseSettings (duration: 00m 47s)
[22:50:12] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:16] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup IV, CommonSettings (duration: 00m 48s)
[22:51:24] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:01] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup IV, InitialiseSettings (duration: 00m 47s)
[22:53:04] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:12] 	 (03PS4) 10Jforrester: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008
[22:53:30] 	 (03PS4) 10Jforrester: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009
[22:53:56] 	 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review, 10User-greg: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10greg) a:05greg→03None
[22:54:21] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 (owner: 10Jforrester)
[22:55:31] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 (owner: 10Jforrester)
[22:56:44] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup V, CommonSettings (duration: 00m 47s)
[22:56:54] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:15] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 (owner: 10Jforrester)
[22:57:30] 	 (03PS4) 10Jforrester: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010
[22:57:39] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup V, InitialiseSettings (duration: 00m 47s)
[22:57:40] 	 (03PS4) 10Jforrester: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011
[22:57:42] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:13] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 (owner: 10Jforrester)
[22:59:56] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup VI, CommonSettings (duration: 00m 48s)
[23:00:04] 	 MaxSem, RoanKattouw, and Niharika: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190523T2300).
[23:00:04] 	 ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:05] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:22] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 (owner: 10Jforrester)
[23:00:50] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup VI, InitialiseSettings (duration: 00m 47s)
[23:00:53] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:56] 	 (03PS1) 10CRusnov: profile::netbox: Add librenms configuration for reports [puppet] - 10https://gerrit.wikimedia.org/r/512299
[23:01:18] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 (owner: 10Jforrester)
[23:01:21] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 (owner: 10Jforrester)
[23:01:30] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 (owner: 10Jforrester)
[23:02:20] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 (owner: 10Jforrester)
[23:02:24] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 (owner: 10Jforrester)
[23:03:45] 	 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review, 10User-greg: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10greg) >>! In T222910#5172142, @jbond wrote: > @greg can you please approve jfishback addition...
[23:03:48] 	 (03PS4) 10Jforrester: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012
[23:04:47] 	 (03CR) 10Jforrester: [C: 03+2] Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 (owner: 10Jforrester)
[23:05:45] 	 (03Merged) 10jenkins-bot: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 (owner: 10Jforrester)
[23:05:49] 	 10Operations, 10Traffic: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus)
[23:06:50] 	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Invariant config cleanup VII–X, CommonSettings (duration: 00m 47s)
[23:06:59] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:41] 	 just me, i guess i can swat
[23:07:55] 	 ebernhardson: Sorry, yes, all done.
[23:07:58] 	 Well, ish.
[23:08:01] 	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Invariant config cleanup VII–X, InitialiseSettings (duration: 00m 48s)
[23:08:04] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:05] 	 Now it's all done.
[23:08:26] 	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:08:28] 	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10greg) Yup, +1. Thanks @Urbanecm. Now for some training with @zeljkofilipin :)
[23:09:15] 	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10greg) a:05greg→03None
[23:09:17] 	 (03Abandoned) 10Jforrester: SDC: Configure initial qualifiers for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502845 (owner: 10Jforrester)
[23:09:18] 	 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus)
[23:09:46] 	 (03PS2) 10Jforrester: SDC: Stop setting wgMediaInfoEnableFilePageDepicts, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060
[23:10:09] 	 (03CR) 10Jforrester: [C: 03+1] "Good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507060 (owner: 10Jforrester)
[23:10:24] 	 (03PS4) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445
[23:10:30] 	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:11:18] 	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov)
[23:12:47] 	 (03PS3) 10Jforrester: Drop the 'inactive' user rights grant, no longer around post-DisableAccount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462592 (https://phabricator.wikimedia.org/T158594)
[23:13:00] 	 (03CR) 10Jforrester: [C: 03+1] "Should be good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462592 (https://phabricator.wikimedia.org/T158594) (owner: 10Jforrester)
[23:30:37] 	 (03CR) 10CRusnov: "Compile looks good https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16750/" [puppet] - 10https://gerrit.wikimedia.org/r/512299 (owner: 10CRusnov)
[23:39:02] 	 (03PS21) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[23:43:02] 	 !log ebernhardson@deploy1001 Started scap: php-1.34.0-wmf.6/extensions/CirrusSearch/includes/ T223738 Consider searching out of limits an error
[23:43:06] 	 (03CR) 10CDanis: "Thanks for the review.  Have fixed or added TODOs for almost all of your comments." (0358 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[23:43:06] 	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:07] 	 T223738: PHP Fatal Error on Special:Search with certain offset query parameter - https://phabricator.wikimedia.org/T223738
[23:46:45] 	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Volans) @Dzahn I guess @georgina would be more appropriate, the expiration contact should be related t...
[23:46:47] 	 hmm, rsync-common taking much longer than normal
[23:48:15] 	 finished, but took 4m22s
[23:49:20] 	 (03CR) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[23:49:32] 	 (03PS22) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[23:52:42] 	 doh, it's taking longer because i `scap sync` instead of sync-file ... oh well i can wait
[23:56:57] 	 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10Volans) p:05Triage→03Normal @HMarcus yes we have the current rule in the exim configuration: ` legalquestions: legal, liaison `  But there is no mention rule that matches `liaison` or any refer...