[16:26:53] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:26:53] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:26:55] ok. here we go [16:26:57] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:26:57] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:27:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:27:08] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:27:11] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:28:03] puppet enabled back on rb1007 [16:28:36] Pchelolo: hm we're not done yet it seems [16:28:47] summary in eqiad still failing [16:28:55] for mcs [16:30:08] I know, I know.. [16:30:10] (03PS3) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [16:30:36] what is going on here? rb1007 reports all good, but scb100* nodes don't concur [16:30:49] (03PS4) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [16:31:19] and i haven't received a single email about any of this from icinga, this is not normal [16:31:22] (03PS3) 10Lucas Werkmeister (WMDE): dologmsg: add -h/--help option [puppet] - 10https://gerrit.wikimedia.org/r/511043 (https://phabricator.wikimedia.org/T222244) [16:31:31] mobrovac: storage is happening there [16:31:40] (03PS1) 10Lucas Werkmeister (WMDE): dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 [16:32:13] Pchelolo: if mcs couldn't retrieve the html, it couldn't have stored garbage for summary [16:32:19] it == restbase [16:32:35] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [16:33:09] PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Outbound interface errors [16:33:23] RECOVERY - Host wtp1035 is UP: PING WARNING - Packet loss = 58%, RTA = 100.78 ms [16:33:32] so, apparently some of your truncates didn't properly propagate.. [16:33:56] mobrovac deleting San_Francisco from latest storage in RB on eqiad fixed summary [16:34:12] mobrovac: I see on https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?contact=all that Icinga sent email to mdholloway and bearND for the mobileapps alerts, but looks like there is something wrong in the configuration re: you [16:34:13] Pchelolo: latest parsoid storage you mean? [16:34:18] yes [16:34:37] cdanis: i should have received for both mcs and restbase, but got none [16:34:45] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 101 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:34:52] Pchelolo: ok, i'll re-issue the truncates [16:34:58] please do [16:35:00] in eqiad [16:35:04] yeah, i'm not sure why, but it is not like icinga sent alerts to you and they got lost somehow [16:35:24] why they didn't propagate properly is a separate question to be investigated in followup [16:36:06] 10Operations, 10ops-codfw, 10Analytics, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) [16:36:17] (03CR) 10Jbond: "> Patch Set 1:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [16:36:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:37:00] there we go [16:37:19] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:37:31] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:37:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:37:43] (03PS1) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [16:37:44] \o/ [16:37:48] ok, that's it, situation back to normal [16:37:51] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:37:59] would someone be willing to write up an incident report? [16:38:13] yeha, we'll have to do that apergos [16:38:14] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-codfw.wikimedia.org recovered from Outbound interface errors [16:38:17] cool [16:38:25] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [16:38:26] (03CR) 10Jbond: "PCC - https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16666/" [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [16:39:03] the good thing is that this outage was very visible to us and almost not visible for the users [16:39:21] !log rebooting wtp1037-1039 [16:39:22] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:39:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:24] yup, i tried using the app multiple times and couldn't get it to spill an error [16:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:56] it only affected the articles we are using for monitoring and a very small subset of articles that were requested in a couple minutes fauilty code was deployed [16:40:04] thank goodness [16:40:13] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:41:00] now the telia circuit comes back too. gee thanks [16:44:16] !log rebooting wtp1040-1042 [16:44:17] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:44:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:16] (03CR) 10Muehlenhoff: [C: 04-2] remove a couple 'absent' clauses for users listed as 'ldap-only' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511718 (owner: 10ArielGlenn) [16:46:31] hm. guess that was the wrong approach :-D [16:46:48] !log rebooting phab1003 (non-prod) [16:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:56] (03Abandoned) 10ArielGlenn: remove a couple 'absent' clauses for users listed as 'ldap-only' [puppet] - 10https://gerrit.wikimedia.org/r/511718 (owner: 10ArielGlenn) [16:48:09] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Kask integration testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) [16:48:32] (03PS1) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 [16:49:34] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Kask integration testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) p:05Triage→03Normal It seems that the cassandra subchart already exists for cask (via... [16:50:05] !log rebooting wtp1043-1045 [16:50:06] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:50:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:33] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:54:41] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Volans) I'm assuming that in cases in which the rebase fails because of conflicts or the CI fails after the rebase Jenkins would vote -1 and the patch would be out of the mergin... [16:55:00] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Volans) p:05Triage→03Normal [16:55:19] !log starting Eqiad LVS re-arrangement shortly - T184293 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/511717 (eqiad front edge is still depooled from public traffic) [16:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:42] !log rebooting wtp1046-1048 [16:55:43] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:55:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] cscott, arlolra, subbu, and halfak: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190521T1700). [17:00:46] 10Operations, 10Wikimedia-Logstash: pybal logs into logstash - https://phabricator.wikimedia.org/T223924 (10fgiunchedi) +1! AFAICS pybal already logs everything to syslog by way of stdout catched by journald, thus the first step should be as easy as adding pybal to `./modules/profile/files/rsyslog/lookup_table... [17:00:52] no parsoid deployment today [17:01:26] (03PS2) 10BBlack: eqiad LVS: All hosts to intended class/primacy [puppet] - 10https://gerrit.wikimedia.org/r/511717 (https://phabricator.wikimedia.org/T184293) [17:02:51] (03CR) 10BBlack: [C: 03+2] eqiad LVS: All hosts to intended class/primacy [puppet] - 10https://gerrit.wikimedia.org/r/511717 (https://phabricator.wikimedia.org/T184293) (owner: 10BBlack) [17:04:35] !log eqiad LVS: high-traffic1 (text): disable pybal on lvs1013 + lvs1001, shifting traffic to lvs1004 [17:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:06] !log eqiad LVS: high-traffic1 (text): puppeting lvs1013, bringing back pybal in primary role, shifting traffic to lvs1013 [17:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:39] (03PS1) 10Dzahn: microsites: comment out git cloning of TransparencyReport-private [puppet] - 10https://gerrit.wikimedia.org/r/511757 [17:07:54] !log eqiad LVS: high-traffic1 (text): puppeting lvs1001, bringing back pybal in backup role, no traffic shift [17:07:55] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [17:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:20] 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10Volans) [17:08:25] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:09:21] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:09:26] !log eqiad LVS: high-traffic1 (text): puppeting lvs1004, basically no-op [17:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:49] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:10:11] (03CR) 10Dzahn: [C: 03+2] microsites: comment out git cloning of TransparencyReport-private [puppet] - 10https://gerrit.wikimedia.org/r/511757 (owner: 10Dzahn) [17:10:46] (03PS2) 10Dzahn: microsites: comment out git cloning of TransparencyReport-private [puppet] - 10https://gerrit.wikimedia.org/r/511757 [17:11:49] !log eqiad LVS: high-traffic2 (upload): disable pybal on lvs1014 + lvs1002, shifting traffic to lvs1005 [17:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:48] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, 10User-greg: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) [17:12:59] (03CR) 10Dzahn: "should fix the broken puppet runs on bromine/vega" [puppet] - 10https://gerrit.wikimedia.org/r/511757 (owner: 10Dzahn) [17:13:21] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:13:31] !log eqiad LVS: high-traffic2 (upload): puppeting lvs1014, bringing back pybal in primary role, shifting traffic to lvs1014 [17:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:13] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:15:29] !log eqiad LVS: high-traffic2 (upload): puppeting lvs1002, bringing back pybal in backup role, no traffic shift [17:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:47] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:16:37] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:17:04] !log eqiad LVS: high-traffic2 (upload): puppeting lvs1005, basically no-op [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:18] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, 10User-greg: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) I hereby confirm authenticity of my SSH key (`ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... [17:18:49] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/Collection/includes/CollectionHooks.php: Fix paths (duration: 00m 56s) [17:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:36] !log eqiad LVS: low-traffic (all internal services): disable pybal on lvs1016 + lvs1015, shifting traffic to lvs1006 [17:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:31] !log eqiad LVS: low-traffic (all internal services): puppeting lvs1015, bringing back pybal in primary role, shifting traffic to lvs1015 [17:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:16] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Volans) I think it's fair to add transitions from pretty much any state to the failed state. My original thought, that's why they are not there in the current version, was that i... [17:23:45] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [17:24:09] !log eqiad LVS: low-traffic (all internal services): puppeting lvs1006, basically no-op [17:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:33] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:25:39] !log eqiad LVS: low-traffic (all internal services): puppeting lvs1016, bringing back pybal in "secondary" role for all 3 traffic classes (high-traffic1, high-traffic2, low-traffic), no traffic shift expected [17:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:33] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:27:21] ah of course, I forgot part of lvs1016's new stuff :P [17:27:23] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:29:08] (03PS1) 10BBlack: lvs1016: add to high-traffic[12] classes [puppet] - 10https://gerrit.wikimedia.org/r/511759 (https://phabricator.wikimedia.org/T184293) [17:30:15] (03CR) 10BBlack: [C: 03+2] lvs1016: add to high-traffic[12] classes [puppet] - 10https://gerrit.wikimedia.org/r/511759 (https://phabricator.wikimedia.org/T184293) (owner: 10BBlack) [17:31:37] !log eqiad LVS: low-traffic (all internal services): puppeting lvs1016, bringing back pybal in "secondary" role for all 3 traffic classes (high-traffic1, high-traffic2, low-traffic), no traffic shift expected (again, after merging last-minute fixup https://gerrit.wikimedia.org/r/c/operations/puppet/+/511759 ) [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:50] apergos, Pchelolo: incident report v1 @ https://wikitech.wikimedia.org/wiki/Incident_documentation/20190521-RESTBase [17:34:02] k mobrovac [17:34:15] I'll review and update in the morning [17:34:22] sure sure, no hurry [17:34:28] thanks folks [17:34:32] i just wanted to get the timeline in while it's still fresh [17:48:28] (03CR) 10Jbond: "PCC1 - https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16649/console" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [17:49:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Iflorez) [17:51:27] !log add BGP sessions to AS202053 in esams [17:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:13] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10BBlack) **Current status of transition:** New hosts: lvs1013 is primary for high-traffic1 lvs1014 is primary for high-traffic2 lvs1015 is primary for low-traffic lvs1016... [17:59:16] (03PS2) 10Andrew Bogott: glance: remove rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/511743 [17:59:17] !log rebooting lvs1016 in attempt to clear interface config issues - T224027 [17:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] T224027: LVS interface settings from /e/n/i not consistently applied on first boots - https://phabricator.wikimedia.org/T224027 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190521T1800) [18:00:12] (03CR) 10Andrew Bogott: [C: 03+2] glance: remove rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/511743 (owner: 10Andrew Bogott) [18:03:53] 10Operations, 10Traffic: LVS interface settings from /e/n/i not consistently applied on first boots - https://phabricator.wikimedia.org/T224027 (10BBlack) FWIW, lvs1016 came back with correct settings after the single additional reboot above. [18:06:46] !log restarting rabbitmq-server on cloudcontrol1003 (turning on HA queues) [18:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:10] (03PS1) 10Jbond: debdeploy client: update defaults [puppet] - 10https://gerrit.wikimedia.org/r/511760 [18:13:31] (03PS5) 10Andrew Bogott: nova: support primary/secondary rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/511727 (https://phabricator.wikimedia.org/T223906) [18:14:39] (03CR) 10Andrew Bogott: [C: 03+2] nova: support primary/secondary rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/511727 (https://phabricator.wikimedia.org/T223906) (owner: 10Andrew Bogott) [18:16:38] (03PS2) 10Andrew Bogott: designate: support primary/secondary rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/511745 (https://phabricator.wikimedia.org/T223906) [18:16:40] (03PS2) 10Andrew Bogott: neutron: support primary/secondary rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/511744 (https://phabricator.wikimedia.org/T223906) [18:24:47] (03CR) 10Dzahn: "Arturo: please note you weren't added by a person but by reviewer-bot" [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [18:26:16] (03CR) 10BBlack: [C: 03+1] "Seems like a great idea to me!" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [18:27:20] (03CR) 10Andrew Bogott: [C: 03+2] designate: support primary/secondary rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/511745 (https://phabricator.wikimedia.org/T223906) (owner: 10Andrew Bogott) [18:34:48] (03PS1) 10BBlack: Revert "Depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/511762 [18:34:55] (03PS2) 10BBlack: Revert "Depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/511762 [18:35:59] !log update lvs static routes on cr1/2-eqiad - T184293 [18:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:29] !log re-pooling eqiad front edge traffic (onto new LVSes from T184293 ) [18:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:42] (03CR) 10BBlack: [C: 03+2] Revert "Depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/511762 (owner: 10BBlack) [18:42:36] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10BBlack) It's been up for ~15 days now without incident, but depooled. Re-pooling it today to see if we can get a recurrence or not. [18:46:15] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10BBlack) Nevermind, apparently it was already repooled, looking at the wrong thing here... [18:49:59] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.06 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:51:44] ^ expected, as a ton of traffic was shifted back to eqiad, from codfw [18:52:02] thanks for confirming [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190521T1900) [19:01:05] since train happened during the EU window today, I'm taking this time to do a small gerrit upgrade (as long as mutante and paladox are around) [19:02:19] * paladox is around :) [19:02:31] is around [19:02:48] cool :) [19:02:57] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) [19:03:25] (03CR) 10Thcipriani: [V: 03+2] Gerrit v2.15.13 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/511735 (owner: 10Thcipriani) [19:04:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Iflorez) Updated the ticket to include a capitalized I in the Wikitech username. The correct Wikitec... [19:06:15] (03PS1) 10Dzahn: transparency report: remove git cloning of private site [puppet] - 10https://gerrit.wikimedia.org/r/511763 [19:06:17] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2de9001]: Gerrit to 2.15.13 (gerrit 2001 only) [19:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:29] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2de9001]: Gerrit to 2.15.13 (gerrit 2001 only) (duration: 00m 11s) [19:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:09] deploy looks ok on gerrit2001, doing cobalt. [19:07:24] :) [19:08:04] (03CR) 10Jbond: "PCC - https://puppet-compiler.wmflabs.org/compiler1002/16667/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/511760 (owner: 10Jbond) [19:08:25] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2de9001]: Gerrit to 2.15.13 (cobalt, restart incoming) [19:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:45] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2de9001]: Gerrit to 2.15.13 (cobalt, restart incoming) (duration: 00m 20s) [19:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:00] (03PS2) 10Jbond: debdeploy client: update defaults [puppet] - 10https://gerrit.wikimedia.org/r/511760 [19:09:09] !log restart gerrit for 2.15.13 update [19:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:25] (03CR) 10jerkins-bot: [V: 04-1] debdeploy client: update defaults [puppet] - 10https://gerrit.wikimedia.org/r/511760 (owner: 10Jbond) [19:11:46] thcipriani or paladox i see gerrit as down is that expected asking as i see a fineished deploy message? [19:11:54] jbond42 it's expected :) [19:11:59] (upgrading to 2.15.13) [19:12:08] ok thanks ill be more patient :) [19:12:40] !log gerrit back on 2.15.13 [19:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:47] jbond42: should be back now [19:13:04] it is thanks [19:13:55] huh, my session didn't get expired this time (yet) [19:14:21] yea. my experience is that sometimes it does and sometimes not [19:14:24] * jbond42 has stop trying to work out that gremlin [19:14:24] did you click the "remember me" option? [19:14:58] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/511760 (owner: 10Jbond) [19:15:36] I always check 'remember me' [19:15:39] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [19:15:47] I filed https://phabricator.wikimedia.org/T222472 for gerrit session expiration. I'm not sure why that is happening for some folks some times. The h2 database on disk seems appropriately sized in terms of the number of users we have. [19:15:48] but some weeks I'm logging in every day [19:15:54] others 2 weeks go by and it's fine [19:16:01] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [19:16:31] subscribed, thanks [19:16:46] i will use cumin to fix those quicker than they self-fix [19:17:10] e.g. 'R:git::clone' 'run-puppet-agent -q --failed-only' [19:18:08] apergos: confirmed what you see .. i wasn't really sure it's that due to a change on our side [19:21:03] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:21:25] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:23:43] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.73 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:25:56] (03PS2) 10Dzahn: transparency report: remove git cloning of private site [puppet] - 10https://gerrit.wikimedia.org/r/511763 [19:27:26] (03CR) 10Dzahn: [C: 03+2] transparency report: remove git cloning of private site [puppet] - 10https://gerrit.wikimedia.org/r/511763 (owner: 10Dzahn) [19:30:33] 10Operations, 10observability, 10puppet-compiler, 10User-herron: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10hashar) [19:30:38] uh, which thing? about random session expiries or...? [19:30:40] 10Operations, 10puppet-compiler, 10Jenkins, 10Patch-For-Review: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10hashar) [19:30:42] ('random') [19:30:51] 10Operations, 10puppet-compiler, 10Jenkins, 10Patch-For-Review: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10hashar) 05Open→03Resolved a:03hashar Follow up is T222075 [19:31:12] 10Operations, 10observability, 10puppet-compiler, 10User-herron: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10hashar) [19:33:11] apergos sessions are stored in a h2 db. Some ideas on the cause is corruption oryour account hitting the expiry date. [19:33:25] oh sorry, that was to mutante [19:33:35] who pinged me but i wasn't quite sure of the context [19:33:42] ok [19:34:15] apergos: i was talking about the "session gets expired or not when gerrit restarts" [19:34:29] ok, I did guess correctly. ty [19:35:03] paladox: there are some weeks where I'm logging in almost every day, so that would rule out the expiry date, I'd thin [19:36:47] actually [19:37:42] account expiry would explain this i guess. Looking at the class [19:38:34] https://github.com/GerritCodeReview/gerrit/blob/master/java/com/google/gerrit/server/cache/h2/H2CacheImpl.java#L394 [19:41:47] it would? how would it be ok some weeks and not others? (it might well be as you say, I just have trouble wrapping my head around it) [19:42:06] we should have a 90day expiration though [19:43:36] !log removing two files for legal compliance [19:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:40] so it should be doing the h2 db query to find the date created for the session and check if it's older than 90 days (seems like the right thing to do anyway, haven't verified that this is *exactly* what the code is doing) [19:45:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Dzahn) Thanks @Iflorez for attention to detail. So the full story is that's all one LDAP user but ther... [19:46:35] It's the only explanation for why it happens to some users and not others. [19:47:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Dzahn) a:05Dzahn→03Nuria [19:54:06] (03PS1) 10Jbond: pybal: remove redundent hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [19:54:08] (03PS1) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 [19:55:19] (03PS2) 10Jbond: pybal: remove redundent hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [19:56:12] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/511766 (owner: 10Jbond) [19:56:34] jbond42: s/redundent/redundant/ ;) [19:56:43] (03PS2) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 [19:59:07] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/511766 (owner: 10Jbond) [20:03:41] thanks volans :) [20:04:21] (03PS3) 10Jbond: pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [20:04:54] (03PS3) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 [20:05:05] RECOVERY - MariaDB Slave Lag: s1 on db2103 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:06:16] thcipriani: i have talked to paladox about the "are we really going to off of mysql/mariadb from gerrit 2.16" part. i said i would probably do that separate like a week after the actual upgrade since it is not mandatory to drop it in the moment of the upgrade. though of course it will be cool to not be dependent on it anymore and that should somewhat unblock HA/failover setup in codfw.. does [20:06:22] that sound right to you? [20:06:57] !log removing (another) two files for legal compliance [20:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:10] mutante: yes, that sounds right to me. Will want to upgrade isolated pieces as much as possible to make sure performance characteristics stay...good-ish :) [20:07:30] ok :) [20:07:45] and failing "good-ish" we'll know which piece caused the problem [20:10:43] right, that's what i was thinking too [20:15:09] mysql will be deprecated? [20:17:55] chaomodus gerrit has dropped the db support completly from gerrit 3.0+ [20:17:57] chaomodus: yes. there are 2 parts to that question though that are unrelated [20:18:11] so it'll do everything in gitL [20:18:13] ? [20:18:25] chaomodus: a) gerrit will not need mysql anymore b) we want to delete the mysql module from puppet , replaced by mariadb [20:18:36] (03PS1) 10Jbond: wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769 [20:18:48] chaomodus yes. [20:18:55] and https://www.h2database.com/ ? [20:19:04] ahh neat [20:19:10] mutante we will only use h2 for the one table [20:19:20] we will drop using h2 when we move to gerrit 3.0 [20:19:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond) [20:19:32] paladox: are there also plans to remove h2... you answered it :) [20:19:51] mutante well h2 will be used for the cache files (as has always been done) [20:20:02] !log disabling puppet on all servers using class memcached (57) [20:20:03] just we won't be falling back to h2 for [database] for the final table [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] paladox: ok! [20:20:18] (03PS4) 10Jbond: pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [20:20:32] by it going pure notedb, it will make backups stable. [20:20:50] (03CR) 10jerkins-bot: [V: 04-1] pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 (owner: 10Jbond) [20:21:12] (03PS4) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 [20:21:19] (03PS4) 10Dzahn: memcached: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) [20:22:20] (03PS2) 10Jbond: wmcs openstack: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511769 [20:22:37] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond) [20:22:43] (03CR) 10Dzahn: "thanks for reviews! disabled puppet with cumin on all 57 hosts using memcached class (not just mc*).. doing a codfw first" [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [20:23:09] (03CR) 10Dzahn: [C: 03+2] memcached: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [20:24:39] jijiki: i'm deploying the mc change [20:26:17] (03PS1) 10Jbond: wmcs openstack: re-add striker::uwsgi::secret_config config [puppet] - 10https://gerrit.wikimedia.org/r/511771 [20:26:48] (03PS2) 10Jbond: wmcs openstack: re-add striker::uwsgi::secret_config config [puppet] - 10https://gerrit.wikimedia.org/r/511771 [20:27:03] (03PS3) 10Jbond: wmcs openstack: re-add striker::uwsgi::secret_config config [puppet] - 10https://gerrit.wikimedia.org/r/511771 [20:27:42] (03CR) 10jerkins-bot: [V: 04-1] wmcs openstack: re-add striker::uwsgi::secret_config config [puppet] - 10https://gerrit.wikimedia.org/r/511771 (owner: 10Jbond) [20:28:27] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/511769 (owner: 10Jbond) [20:31:09] !log mc2019 - stopping memcached and letting puppet restart it to confirm no issues after switching to systemd::service [20:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:54] 10Operations, 10ops-codfw, 10Analytics, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) [20:36:44] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10Papaul) [20:37:33] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10herron) [20:37:46] (03PS5) 10Jbond: pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [20:37:54] 10Operations, 10Wikimedia-Logstash, 10User-herron: [stretch] Implement sensitive log access control, onboard 3 sensitive log producers - https://phabricator.wikimedia.org/T213902 (10herron) [20:38:00] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10herron) [20:38:15] 10Operations, 10Wikimedia-Logstash, 10User-herron: Implement sensitive log access control - https://phabricator.wikimedia.org/T213902 (10herron) [20:38:21] 10Operations, 10Wikimedia-Logstash, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10herron) [20:38:54] (03PS5) 10Jbond: pybaltest: re-add tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/511767 [20:39:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) [20:43:08] !log re-enabling puppet on all hosts using memcached class - except mc1* [20:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:32] 10Operations, 10ops-codfw, 10Analytics, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) @Ottomata can you please provide me with the partman recipe for those systems and for the RAID config do you want user 256KB for RAID Stripe size? [20:49:22] (03PS1) 10Herron: WIP: add multi-instance support to kibana [puppet] - 10https://gerrit.wikimedia.org/r/511772 (https://phabricator.wikimedia.org/T213902) [20:49:53] (03CR) 10jerkins-bot: [V: 04-1] WIP: add multi-instance support to kibana [puppet] - 10https://gerrit.wikimedia.org/r/511772 (https://phabricator.wikimedia.org/T213902) (owner: 10Herron) [20:51:06] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, 10User-greg: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10RStallman-legalteam) Updated NDA for Urbanecm is fully signed and on file. [20:51:42] bstorm_: the labstore1004/1005 having puppet disabled / zero resources tracked is known ? [20:51:54] That is not known! [20:51:57] Looking at it [20:52:45] ah :) cool [20:52:59] Ah. There's no notes_url on an old bit of monitoring definition [20:53:00] (03PS2) 10Herron: WIP: add multi-instance support to kibana [puppet] - 10https://gerrit.wikimedia.org/r/511772 (https://phabricator.wikimedia.org/T213902) [20:53:27] bstorm_: i meant "CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle." [20:53:31] (03CR) 10jerkins-bot: [V: 04-1] WIP: add multi-instance support to kibana [puppet] - 10https://gerrit.wikimedia.org/r/511772 (https://phabricator.wikimedia.org/T213902) (owner: 10Herron) [20:53:37] let me know where to add notes_urls though [20:53:44] i want to get them all done :) [20:54:06] fwiw mutante that one in particular has been popping up (and then going away) on a wide variety of hosts for the past few weeks [20:54:30] (03PS3) 10Herron: WIP: add multi-instance support to kibana [puppet] - 10https://gerrit.wikimedia.org/r/511772 (https://phabricator.wikimedia.org/T213902) [20:54:33] cdanis: hmm.. ACK! though this one seems to be persistent since 5 days and the others were transient? [20:54:40] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) They preferably would happen in the main DC vi... [20:54:43] ah ok, did not realize it was persistent [20:56:38] mutante: the problem is that it has a notes_url line now for a module that doesn't have the parameter apparently [20:56:57] Nrpe::Monitor_systemd_unit_state[drbd] [20:58:25] Looks like something changed that this code wasn't prepared for [20:58:27] bstorm_: uh.. interesting, let me check that [20:58:31] could be me [20:59:11] nrpe::monitor_service is what has it [20:59:39] this seems like a special class that might as well use the generic one [21:01:02] but this was fine: https://gerrit.wikimedia.org/r/c/operations/puppet/+/509553/2/modules/nrpe/manifests/monitor_systemd_unit_state.pp [21:02:04] i'll fix it one way or another .. looking closer [21:09:23] yea, i broke it in https://gerrit.wikimedia.org/r/c/operations/puppet/+/509545/2/modules/labstore/manifests/monitoring/primary.pp .. one class is not like the others [21:10:00] though we can replace that with the one that does the same and is used in base.. probably [21:11:13] or not.. a bunch of systems using that. adding the parameter instead [21:13:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Nuria) We need to legal to please confirm NDA status plus also an expiration date for access for @Iflo... [21:14:59] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 137.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [21:21:25] (03PS1) 10Dzahn: nrpe: add notes_url to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/511779 (https://phabricator.wikimedia.org/T197873) [21:21:47] (03CR) 10jerkins-bot: [V: 04-1] nrpe: add notes_url to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/511779 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:21:47] !log re-enabling puppet on mc1* hosts [21:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10RStallman-legalteam) Signed NDA confirmed. Contract is through May 31, 2020. [21:23:33] (03CR) 10Dzahn: "carefully re-enabled puppet in steps. it was noop on all i checked" [puppet] - 10https://gerrit.wikimedia.org/r/506559 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:23:48] (03CR) 10Hashar: "Dzahn the Jenkins builds are automatically deleted after some days (7, 15 or 30 days depending on the job traffic)." [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:24:36] (03CR) 10Dzahn: [C: 04-1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:24:46] (03CR) 10Dzahn: "@paladox" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:24:49] (03CR) 10jerkins-bot: [V: 04-1] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:25:20] paladox: wanna rebase^? [21:25:25] yup [21:25:30] (03PS2) 10Awight: Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [21:26:23] (03PS15) 10Paladox: zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [21:27:02] (03CR) 10jerkins-bot: [V: 04-1] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:27:33] (03CR) 10Hashar: "As for the change proposed: I don't think we need to setup all those .py aliases. We can just run them from the /srv/deployment/x/y/z path" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [21:27:45] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, and 4 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Ladsgroup) [21:27:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Ladsgroup) [21:28:19] (03CR) 10Awight: [C: 03+1] Point beta commonswiki to local CommonsHelper config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511678 (https://phabricator.wikimedia.org/T223379) (owner: 10Awight) [21:28:48] (03CR) 10Awight: [C: 03+1] Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [21:28:59] (03PS3) 10Awight: Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [21:29:07] (03CR) 10Awight: [C: 03+1] Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [21:29:53] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Ladsgroup) [21:30:49] 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10Ladsgroup) [21:31:09] 10Operations, 10Scap, 10Wikimedia-Incident: Update Debian Package for Scap3 to 3.8.3-1 - https://phabricator.wikimedia.org/T198277 (10Ladsgroup) [21:31:21] (03PS2) 10Dzahn: nrpe: add notes_url to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/511779 (https://phabricator.wikimedia.org/T197873) [21:38:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Dzahn) Updating patch to include expiry_date May 31, 2020. Who should be expiry_contact? Nuria? [21:44:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16673/" [puppet] - 10https://gerrit.wikimedia.org/r/511779 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:46:28] bstorm_: puppet fixed on labstore1004/1005. a bunch of stuff got applied that happened in puppet in the last couple days [21:47:07] incl. nfs-manage-binds [21:47:09] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:47:12] ^ [21:47:51] the bonus part: it now has a default URL that is used unless overriden by a more specific one [21:47:57] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:50:14] (03PS2) 10Dzahn: gerrit: don't include nrpe in jetty [puppet] - 10https://gerrit.wikimedia.org/r/511615 [21:51:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16674/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/511615 (owner: 10Dzahn) [21:51:48] mutante: thanks! [21:52:08] yw, i broke it :P [21:54:20] (03PS2) 10Dzahn: swift: stop including ::nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/511617 [22:00:09] 10Operations, 10ops-eqiad, 10cloud-services-team: cloudvirt1028 - no PS redundancy - https://phabricator.wikimedia.org/T224065 (10Dzahn) [22:00:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cloudvirt1028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn https://phabricator.wikimedia.org/T224065 [22:01:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16675/" [puppet] - 10https://gerrit.wikimedia.org/r/511617 (owner: 10Dzahn) [22:06:25] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10Ladsgroup) [22:06:36] (03PS1) 10Bstorm: wikilabels: add secondary db role for replication [puppet] - 10https://gerrit.wikimedia.org/r/511786 (https://phabricator.wikimedia.org/T224062) [22:08:46] (03CR) 10Bstorm: [C: 03+2] wikilabels: add secondary db role for replication [puppet] - 10https://gerrit.wikimedia.org/r/511786 (https://phabricator.wikimedia.org/T224062) (owner: 10Bstorm) [22:17:04] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) [22:25:55] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Krenair) Does the Foundation have an NDA with modular.im? [22:26:47] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, 10User-greg: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Dzahn) [22:31:40] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Dzahn) The comments on T215042#4977385 sounded like this wasn't going to be done, for the temporary evaluation that... [22:35:41] (03PS4) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) [22:40:55] (03PS2) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [22:43:45] (03CR) 10Dzahn: [C: 03+2] vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:44:57] (03CR) 10Dzahn: "compiler looks good https://puppet-compiler.wmflabs.org/compiler1001/16676/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [22:48:54] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:51:43] !log decommissioning restbase1007-b -- T223976 [22:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:49] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [22:55:58] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [22:56:03] !log ms-be2034 - sudo systemctl reset-failed [22:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:37] !log ms-be2034 - degraded systemd state was cleared and originally caused by " failed Session 72587 of user debmonitor" [22:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:14] (03PS3) 10Dzahn: mediawiki::cgroup: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448778 (https://phabricator.wikimedia.org/T194724) [23:00:00] (03PS4) 10Alex Monk: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [23:00:04] MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190521T2300). [23:00:05] awight: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] Thank you, jouncebot [23:02:06] (03CR) 10Alex Monk: "Idce6179e conflicted due to changing deleted file, rebased and beta's puppet repo is better again" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [23:09:05] 10Operations, 10Growth-Team, 10MediaWiki-Cache, 10Performance-Team, and 5 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) [23:09:12] 10Operations, 10Growth-Team, 10MediaWiki-Cache, 10Performance-Team, and 5 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) [23:09:25] 10Operations, 10Growth-Team, 10Performance-Team, 10Wikidata, and 4 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) [23:10:03] Guess I can deploy [23:11:17] (03PS2) 10MaxSem: Point beta commonswiki to local CommonsHelper config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511678 (https://phabricator.wikimedia.org/T223379) (owner: 10Awight) [23:11:22] (03CR) 10MaxSem: [C: 03+2] Point beta commonswiki to local CommonsHelper config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511678 (https://phabricator.wikimedia.org/T223379) (owner: 10Awight) [23:12:01] MaxSem: Thanks! [23:12:27] (03Merged) 10jenkins-bot: Point beta commonswiki to local CommonsHelper config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511678 (https://phabricator.wikimedia.org/T223379) (owner: 10Awight) [23:12:41] (03CR) 10jenkins-bot: Point beta commonswiki to local CommonsHelper config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511678 (https://phabricator.wikimedia.org/T223379) (owner: 10Awight) [23:15:22] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 100.6 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [23:15:45] MaxSem: Confirmed that the config took effect and did more or less what I wanted. [23:17:13] awight: I'm confused about https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/511667/ - the variable still exists in MW and is not deprecated [23:17:22] looking... [23:18:11] I see that it's still set in mw-core, but not read. [23:19:42] https://gerrit.wikimedia.org/g/mediawiki/core/+/60fbf0d875c27dc6608c97ce1f96e1df3d88b788/includes/diff/DifferenceEngine.php#1175 [23:20:22] It's still read in two places https://codesearch.wmflabs.org/search/?q=WikiDiff2MovedParagraphDetectionCutoff&i=nope&files=&repos= [23:20:38] However, as I understand it, the curently deployed version of wikidiff2 (in C) ignores the parameter. [23:21:20] I don't know whether 25 vs 0 itself might somehow change behaviour even if it's actual value is ignored. [23:22:33] (03CR) 10Krinkle: "Thanks Alex" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [23:23:14] I don't like how... wrong I was. But the usages do seem fairly safe. Worst case might be that some cached diffs are invalidated. [23:24:41] Hmm, the extension doesn't use that parameter [23:25:07] https://github.com/wikimedia/mediawiki-php-wikidiff2/blob/master/php_wikidiff2.cpp#L95 [23:25:12] I guess the worst case might actually be that *all* cached diffs are invalidated. [23:25:16] And 1.8.1 is deployed [23:26:44] OK, I'll deploy but please deprecate the variable in MW [23:27:12] (03CR) 10Krinkle: "Compiler: https://puppet-compiler.wmflabs.org/compiler1001/16677/" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [23:27:15] Definitely. [23:28:01] (03PS3) 10MaxSem: Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:28:08] (03CR) 10MaxSem: [C: 03+2] Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:29:08] The remaining cleanup is on our list for T194272--maybe it would have been better to remove the usages first, of course. [23:29:09] T194272: Clean up config variable handling - https://phabricator.wikimedia.org/T194272 [23:29:11] (03Merged) 10jenkins-bot: Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:29:26] (03CR) 10jenkins-bot: Remove deprecated wikidiff2 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511667 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:29:54] (03CR) 10Dzahn: "whatever the real issues is but see https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/13696/console" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [23:30:11] awight: pulled on mwdebug1002, please test [23:32:18] MaxSem: diffs still work there [23:32:34] Do they work... well? :P [23:33:01] hehe [23:33:17] They look as crappy as ever. [23:34:16] !log maxsem@deploy1001 Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/511667/ (duration: 00m 56s) [23:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:49] Let's wait a bit, meanwhile... [23:34:57] (03PS4) 10MaxSem: Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:42:14] (03PS2) 10Dzahn: monitoring: add Icinga notes_url for unmerged changes check [puppet] - 10https://gerrit.wikimedia.org/r/510963 [23:42:25] No noticeable difference in error levels, let's continue [23:42:35] (03CR) 10MaxSem: [C: 03+2] Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:42:49] (03PS3) 10Dzahn: monitoring: add Icinga notes_url for unmerged changes check [puppet] - 10https://gerrit.wikimedia.org/r/510963 (https://phabricator.wikimedia.org/T197873) [23:43:32] (03Merged) 10jenkins-bot: Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:43:48] (03CR) 10jenkins-bot: Remove deprecated wikidiff2 $wmg variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511668 (https://phabricator.wikimedia.org/T194272) (owner: 10WMDE-Fisch) [23:44:25] awight: pulled on mwdebug1002, please test [23:44:31] sure [23:44:42] (03CR) 10Dzahn: [C: 03+2] monitoring: add Icinga notes_url for unmerged changes check [puppet] - 10https://gerrit.wikimedia.org/r/510963 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:45:02] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) >>! In T223835#5202604, @Krenair wrote: > Does the Foundation have an NDA with modular.im? NDA for what? This... [23:45:05] (03PS4) 10Dzahn: monitoring: add Icinga notes_url for unmerged changes check [puppet] - 10https://gerrit.wikimedia.org/r/510963 (https://phabricator.wikimedia.org/T197873) [23:46:01] MaxSem: Verified wikidiff2'ing [23:47:57] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/511668/ (duration: 00m 57s) [23:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:10] awight: ^ [23:48:25] Cheers! [23:55:01] (03PS1) 10Dzahn: delete the cgred module [puppet] - 10https://gerrit.wikimedia.org/r/511791 (https://phabricator.wikimedia.org/T194724)