[00:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:02:49] (03PS1) 10Dzahn: gerrit: also monitor gerrit process on replicas [puppet] - 10https://gerrit.wikimedia.org/r/548556 [00:02:51] (03CR) 10Paladox: "We may have an issue because we require review_site in java.pp but depend on it before jetty which defines it." [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [00:03:56] (03CR) 10jerkins-bot: [V: 04-1] gerrit: also monitor gerrit process on replicas [puppet] - 10https://gerrit.wikimedia.org/r/548556 (owner: 10Dzahn) [00:08:36] (03PS2) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 [00:09:47] (03CR) 10jerkins-bot: [V: 04-1] gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [00:09:47] (03PS2) 10Dzahn: gerrit: also monitor gerrit process on replicas [puppet] - 10https://gerrit.wikimedia.org/r/548556 [00:12:14] (03CR) 10Paladox: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/548556 (owner: 10Dzahn) [00:12:33] (03CR) 10Paladox: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/548545 (owner: 10Dzahn) [00:13:02] !log gerrit2001 - restart gerrit (replica) [00:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:26] (03CR) 10Dzahn: [C: 03+2] gerrit: also monitor gerrit process on replicas [puppet] - 10https://gerrit.wikimedia.org/r/548556 (owner: 10Dzahn) [00:15:00] jouncebot: next [00:15:00] In 11 hour(s) and 44 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T1200) [00:15:43] !log gerrit - restarting service to re-enable jgit gc (T217497) [00:15:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:48] T217497: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 [00:16:07] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:18:57] (03CR) 10Dzahn: [C: 03+2] gerrit: link /etc/gerrit to /var/lib/gerrit2/review_site/etc [puppet] - 10https://gerrit.wikimedia.org/r/548545 (owner: 10Dzahn) [00:19:05] (03PS3) 10Dzahn: gerrit: link /etc/gerrit to /var/lib/gerrit2/review_site/etc [puppet] - 10https://gerrit.wikimedia.org/r/548545 [00:21:41] (03PS3) 10Dzahn: gerrit: ensure tmp dir under review_site exists [puppet] - 10https://gerrit.wikimedia.org/r/548550 (https://phabricator.wikimedia.org/T176774) [00:23:15] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19239/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/548550 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [00:24:15] (03CR) 10Dzahn: [C: 04-2] "one more jessie Gerrit in cloud needs to be upgraded / replaced" [puppet] - 10https://gerrit.wikimedia.org/r/548547 (owner: 10Dzahn) [00:26:42] (03PS1) 10Dzahn: gerrit: keep the mode on the tmp dir 0700 [puppet] - 10https://gerrit.wikimedia.org/r/548566 [00:27:09] (03PS2) 10Catrope: GrowthExperiments: Enable suggested edits, but as opt-in only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547856 (https://phabricator.wikimedia.org/T236968) [00:27:33] (03CR) 10Dzahn: [C: 03+2] gerrit: keep the mode on the tmp dir 0700 [puppet] - 10https://gerrit.wikimedia.org/r/548566 (owner: 10Dzahn) [00:28:25] (03CR) 10Dzahn: "previous puppet run: Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/tmp]/mode: mode changed '0700' to '0775'" [puppet] - 10https://gerrit.wikimedia.org/r/548566 (owner: 10Dzahn) [00:53:44] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:55:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:25] (03PS1) 10Dmaza: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) [01:04:13] (03CR) 10jerkins-bot: [V: 04-1] Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [01:06:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:56] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:32:48] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:49:38] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:50:26] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:12:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:23:18] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:41:06] (03PS1) 10Ammarpad: Rename DPL extension global to non-ambiguous name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 [02:59:48] !log depool cp3057 - T237348 [02:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:54] T237348: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 [03:01:27] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3057.esams.wmnet [03:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:40] RECOVERY - Host cp3057 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms [03:20:15] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:23] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:23:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:24:16] PROBLEM - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:24:24] woop [03:24:30] checking [03:24:39] thx! [03:24:58] let me know if there is a need for an IC [03:26:00] wdqs1004 and 1005 are having issues :/ [03:26:20] I wonder why icinga isn't reporting anything there [03:27:19] following along, here if you need any inexperienced help :) [03:28:10] RECOVERY - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 475 bytes in 1.013 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:28:42] oh nice.. the icinga check goes to UNKNOWN instead of CRITICAL if it gets a timeout on wdqs [03:28:44] here. wdqs1004 does show 2 checks in UNKNOWN [03:28:48] yeah... [03:29:10] restart blazegraph service ? [03:29:16] making some guesses here: [03:29:18] well. it recovered already [03:29:38] all i know if it was wdqs in the past usually we could restart blazegraph to fix something [03:29:41] I think this is another case of WDQS update/write load being too high in conjunction with the read rate [03:30:43] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:30:55] wdqs1004 is still showing the UNKNOWNs on the HTTP port [03:31:13] and 1005 shows it on the blazagraph port [03:31:41] unknown means timeout after 10s. could just take longer [03:31:49] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:34:49] timeout changed to 429 Too Many Requests [03:35:20] uh? [03:35:46] on wdqs1004, the HTTP Port check [03:36:59] if the metrics are to be believed (and it is hard to say since they get scraped from the same HTTP servers that were unresponsive), both 1005 and 1006 were GC thrashing pretty hard for the past half hour [03:38:30] sigh... I just got a strack trace of the readiness-probe of wdqs1004 from lvs1016 [03:39:28] oh.. and I'm getting 404s on icinga.wikimedia.org/alerts from time to time [03:39:29] wtf? [03:40:17] this is more specific wdqs https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=wdqs [03:41:46] cdanis: am I looking at the right thing? https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=21&fullscreen&orgId=1&from=now-3h&to=now I see thrashing at 1005 in young and oldgen, but not at 1006 [03:42:53] rlazarus: you are right. so many time series... [03:43:01] tempting to restart wdqs-blazegraph on 1004 (that did it before) but also syslog there shows wdqs-blazegraph is actively handling wikidata pages [03:43:43] mutante: 1004 isn't handling queries though? [03:43:46] mutante: hmmm wdqs1004 isn't replying pybal healthchecks... so it's effectively depooled [03:44:18] WARN: wdqs1004.eqiad.wmnet (enabled/partially up/not pooled) from lvs logs [03:44:20] http://discovery.wmflabs.org/wdqs/ [03:44:58] cdanis: tail -f /var/log/syslog looked pretty active [03:45:27] could it still be working through a deep queue? [03:45:45] if it's GC thrashing that tracks [03:46:14] !log wdqs1004 restarting wdqs-blazegraph [03:46:16] 1004 hasn't reported any exported-by-blazegraph metrics since 22:00 UTC [03:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:30] so.. or it's some internal client hitting wdqs1004... or it's a deep queue as rlazarus points out [03:46:53] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=22&fullscreen&orgId=1&from=now-12h&to=now and click on 1004 in the legend [03:48:11] mutante: it looks like the restart fixed the issue from pybal's point of view [03:48:25] vgutierrez: ok, cool [03:48:51] so, 1004 got wedged ~6 hours ago, then 1005 thrashing because of load? [03:49:28] recovery on the port check in icinga [03:49:36] and High Update Lag down to WARN [03:49:49] (03CR) 10Cwhite: "This change is ready for review." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [03:51:41] !log pooling cp3057 - T237348 [03:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:46] T237348: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 [03:52:40] https://phabricator.wikimedia.org/T213191 [03:53:19] https://phabricator.wikimedia.org/T175919 [03:59:19] well..one more https://phabricator.wikimedia.org/T159245 but out again for now [04:12:11] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:12:22] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp5010 [puppet] - 10https://gerrit.wikimedia.org/r/548574 (https://phabricator.wikimedia.org/T231627) [04:12:24] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5010 [puppet] - 10https://gerrit.wikimedia.org/r/548575 (https://phabricator.wikimedia.org/T231627) [04:13:33] !log Switch from nginx to ats-tls on cp5010 - T231627 [04:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:49] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:14:50] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp5010 [puppet] - 10https://gerrit.wikimedia.org/r/548574 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:17:05] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5010 [puppet] - 10https://gerrit.wikimedia.org/r/548575 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:29:20] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp5011 [puppet] - 10https://gerrit.wikimedia.org/r/548576 (https://phabricator.wikimedia.org/T231627) [04:29:22] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5011 [puppet] - 10https://gerrit.wikimedia.org/r/548577 (https://phabricator.wikimedia.org/T231627) [04:30:28] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp5011 [puppet] - 10https://gerrit.wikimedia.org/r/548576 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:30:53] !log Switch from nginx to ats-tls on cp5011 - T231627 [04:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:58] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:33:29] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5011 [puppet] - 10https://gerrit.wikimedia.org/r/548577 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:35:50] (03PS2) 10Ayounsi: Intial interfaces templates [homer/public] - 10https://gerrit.wikimedia.org/r/547584 [05:29:46] (03PS1) 10Vgutierrez: prometheus: Tell between upload and text on trafficserver exporters [puppet] - 10https://gerrit.wikimedia.org/r/548581 (https://phabricator.wikimedia.org/T236482) [05:32:26] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:34:37] (03PS2) 10Vgutierrez: prometheus: Tell between upload and text on trafficserver exporters [puppet] - 10https://gerrit.wikimedia.org/r/548581 (https://phabricator.wikimedia.org/T236482) [06:58:20] (03CR) 10Giuseppe Lavagetto: httpbb: Create a new Puppet module for httpbb. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [07:23:02] (03CR) 10Jcrespo: [C: 03+2] "/a/sqldata can go too, BTW" [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [07:23:12] (03PS4) 10Jcrespo: labweb/cloudweb: fix path for backup jobs [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [07:26:31] jynus: hey, for when you have time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/547271 [07:26:38] ok [07:26:48] after the one I just +2ed [07:27:59] Thanks! [07:29:05] (03CR) 10Jcrespo: [C: 03+1] Revert "mediawiki: Temporary disable rebuildTermIndex" [puppet] - 10https://gerrit.wikimedia.org/r/547271 (owner: 10Ladsgroup) [07:29:14] (03PS3) 10Jcrespo: Revert "mediawiki: Temporary disable rebuildTermIndex" [puppet] - 10https://gerrit.wikimedia.org/r/547271 (owner: 10Ladsgroup) [07:29:54] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:31:30] (03CR) 10Jcrespo: [C: 03+2] Revert "mediawiki: Temporary disable rebuildTermIndex" [puppet] - 10https://gerrit.wikimedia.org/r/547271 (owner: 10Ladsgroup) [07:31:41] (03PS1) 10Ladsgroup: Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) [07:32:26] (03CR) 10jerkins-bot: [V: 04-1] Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [07:32:59] Amir1: mwmaint1002: Notice: /Stage[main]/Mediawiki::Maintenance::Wikidata/Cron[wikidata-rebuildItemTerms]/ensure: created [07:33:15] thanks. It will start in 58 minutes [07:33:24] sorry I missed the window [07:35:38] (03CR) 10Ladsgroup: "@Alex: Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [07:36:27] All good, It was down for days, it can wait for an hour [07:37:02] we might be slowly increase the speed though. I will do it once I'm back from this conference [07:37:02] if I can compensate with any other review, let me know [07:37:30] Sure. Thanks! [07:39:30] onimisionipe, gehel: o/ - just checking the icinga alerts, the maps OSM syncro lag alerts seem not actionables from what I can see (the dashboard points to a growing trend from days ago), is there anything burning or we can ack? [07:40:09] (03PS2) 10Ladsgroup: Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) [07:44:37] elukey: sorry.. I should have ack that [07:44:41] there's a task [07:46:35] (03PS2) 10Elukey: eventlogging: run sanitization script on all the db records [puppet] - 10https://gerrit.wikimedia.org/r/548318 (https://phabricator.wikimedia.org/T236818) [07:46:46] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on icinga1001 is CRITICAL: 7.191e+05 ge 2.592e+05 Mathew.onipe See https://phabricator.wikimedia.org/T237228 - The acknowledgement expires at: 2019-11-06 07:46:08. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [07:47:10] onimisionipe: nono I was just checking, I imagined it was something ackable but I wanted to triple check :) thanks! [07:48:01] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 7.192e+05 ge 2.592e+05 Mathew.onipe see https://phabricator.wikimedia.org/T237228 - The acknowledgement expires at: 2019-11-06 07:46:52. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [07:49:37] (03CR) 10Elukey: [C: 03+2] eventlogging: run sanitization script on all the db records [puppet] - 10https://gerrit.wikimedia.org/r/548318 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [07:56:04] (03PS1) 10Jcrespo: bacula-director: remove unused /a/sqldata, /mnt/a filesets [puppet] - 10https://gerrit.wikimedia.org/r/548585 (https://phabricator.wikimedia.org/T229209) [08:02:06] !log redact mnwwiki on db1124 and db2094 T235743 [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:11] T235743: Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 [08:07:01] (03CR) 10Ema: [C: 03+1] "LGTM, let's see what godog says!" [puppet] - 10https://gerrit.wikimedia.org/r/548581 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [08:12:15] !log uploaded fifo-log-demux 0.6 to apt.wikimedia.org (stretch) [08:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula-director: remove unused /a/sqldata, /mnt/a filesets [puppet] - 10https://gerrit.wikimedia.org/r/548585 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:23:51] there seems to be an email from Telia --^ [08:24:29] not a planned maintenance [08:25:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] hiera: update ores to pass statsd through statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:26:28] RECOVERY - Host cp5012 is UP: PING OK - Packet loss = 0%, RTA = 235.84 ms [08:26:51] don't be too excited, it just rebooted into the installer ^ [08:27:02] \o/ /o\ [08:29:44] PROBLEM - check_trafficserver_log_fifo_purge_backend on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:29:48] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [08:29:52] PROBLEM - Confd vcl based reload on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [08:29:52] PROBLEM - dhclient process on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:29:52] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:29:52] PROBLEM - Ensure traffic_manager is running for instance tls on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:29:52] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:30:02] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:30:02] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp5012 is CRITICAL: connect to address 10.132.0.112 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [08:30:02] PROBLEM - SSH on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:02] PROBLEM - HTTPS Unified RSA on cp5012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:30:02] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:30:12] PROBLEM - HTTPS Unified ECDSA on cp5012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:30:36] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [08:31:07] sorry [08:32:20] PROBLEM - Host cp5012 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:57] ema: I am assuming just cp5012 is down, no actionable needed? [08:35:22] jynus: correct [08:35:36] see https://phabricator.wikimedia.org/T237360 [08:36:40] RECOVERY - Host cp5012 is UP: PING OK - Packet loss = 0%, RTA = 235.22 ms [08:39:20] (03CR) 10Jcrespo: [C: 03+2] bacula-director: remove unused /a/sqldata, /mnt/a filesets [puppet] - 10https://gerrit.wikimedia.org/r/548585 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:40:44] (03CR) 10Jcrespo: "Notice: /Stage[main]/Bacula::Director/File[/etc/bacula/conf.d/fileset-a-sqldata.conf]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/548585 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:41:01] s/actionable/urgent action needed/ [08:42:54] RECOVERY - SSH on cp5012 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:47:54] (03CR) 10Volans: [C: 04-2] "The email check is saying a slightly different thing, that the datacenter aliases, all together, do not match the same set of hosts that a" [puppet] - 10https://gerrit.wikimedia.org/r/548539 (owner: 10Dzahn) [08:52:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:55:41] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 92 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [08:56:14] ^akosiaris [08:56:19] :-) [08:58:16] \ο/ [08:58:20] nice! [09:00:04] (03CR) 10Muehlenhoff: [C: 03+1] "This looks good. We should add a way to have extensions define what extra packages we need a run time, right now there's too much guess-wo" [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) (owner: 10Dzahn) [09:05:37] (03CR) 10Muehlenhoff: [C: 04-1] "That's not needed, puppet masters for Cloud VPS are now running in Cloud VPS, so the labtestpuppetmaster should be removed as well (labpup" [puppet] - 10https://gerrit.wikimedia.org/r/548539 (owner: 10Dzahn) [09:07:26] (03CR) 10Muehlenhoff: "Tasks have been filed to remove jessie instances, but that is still ongoing. You could check with the Cumin instance for Cloud VPS whether" [puppet] - 10https://gerrit.wikimedia.org/r/548439 (owner: 10Dzahn) [09:07:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548581 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [09:10:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:11:20] !log restart dbprov2001 T236924 [09:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:29] T236924: dbprov2002 slower to generate snapshots - https://phabricator.wikimedia.org/T236924 [09:13:33] (03PS2) 10Elukey: Add analytics users (without ssh keys) to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) [09:14:51] (03CR) 10Elukey: [C: 03+2] Add analytics users (without ssh keys) to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:15:45] (03PS2) 10Elukey: Add druid/analytics/search system users to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548294 (https://phabricator.wikimedia.org/T237269) [09:17:48] (03CR) 10Elukey: [C: 03+2] Add druid/analytics/search system users to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548294 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:18:34] !log restart dbprov100[12] T236924 [09:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:38] T236924: dbprov2002 slower to generate snapshots - https://phabricator.wikimedia.org/T236924 [09:19:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::packages: re-add ploticus for EasyTimeline extension [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) (owner: 10Dzahn) [09:22:25] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548294 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:23:00] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/548292 (owner: 10Filippo Giunchedi) [09:24:08] !log wb2-phab stopped saying things a while ago. Restarted [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) p:05Triage→03Normal [09:25:08] I'm glad you're feeling better, wb2-phab [09:27:28] one has to wonder how useful this bot is, if it was gone for many hours and nobody noticed? [09:29:20] good question [09:30:07] the last time it said something was last EU evening (22:01 UTC) [09:30:25] so 10+ hours of silence basicalyl [09:30:33] (03CR) 10Muehlenhoff: [C: 04-1] rsync: add option to TLS-wrap communications (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (owner: 10CDanis) [09:34:03] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:06] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:51] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Tell between upload and text on trafficserver exporters [puppet] - 10https://gerrit.wikimedia.org/r/548581 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [09:40:25] (03CR) 10Volans: wmf_auto_reimage: Adjust message about waiting for puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [09:40:44] (03CR) 10Mobrovac: [C: 04-1] "Overall LGTM, some comments in-line." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [09:43:58] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) As an update, cp5012 is currently reimaging (`Started first puppet run` phase). The initramfs looks like this right now: ` root@cp5012:~# ls -l /boot/... [09:45:43] 10Operations, 10serviceops: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Joe) [09:49:20] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:49:43] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Joe) As far as etcd is concerned, a rolling restart should be enough to ensure the new CA is picked up. I will take care of that. [09:51:00] 10Operations, 10Puppet, 10serviceops, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) [09:52:09] 10Operations, 10Puppet, 10serviceops, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) [09:54:29] (03PS2) 10Jbond: puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) [09:54:53] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` and were **ALL** successful. [09:55:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:55:23] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: remove test_servers logic [puppet] - 10https://gerrit.wikimedia.org/r/541830 (https://phabricator.wikimedia.org/T235077) (owner: 10Jbond) [09:56:23] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) This time, after reimaging the host it did boot properly. Also, initramfs size is now in line with that of other cp5 systems: ` cp5010.eqsin.wmnet: -r... [09:59:39] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) >>! In T237259#5634722, @Joe wrote: > As far as etcd is concerned, a rolling restart should be enough to ensure the new CA is picked up. I will take care of that... [10:00:04] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [10:02:18] (03PS1) 10Vgutierrez: prometheus: Switch trafficserver configuration exporters to cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) [10:04:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Switch trafficserver configuration exporters to cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [10:08:01] (03PS2) 10Vgutierrez: prometheus: Switch trafficserver configuration exporters to cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) [10:13:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [10:14:08] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Switch trafficserver configuration exporters to cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [10:14:12] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: add dashboard links to grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/548292 (owner: 10Filippo Giunchedi) [10:15:20] (03PS3) 10Vgutierrez: prometheus: Switch trafficserver configuration exporters to cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/548702 (https://phabricator.wikimedia.org/T236482) [10:16:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I got some ugly metric names in prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [10:16:39] (03PS1) 10Alexandros Kosiaris: Revert "hiera: update ores to pass statsd through statsd_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/548704 [10:16:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "hiera: update ores to pass statsd through statsd_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/548704 (owner: 10Alexandros Kosiaris) [10:16:51] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "hiera: update ores to pass statsd through statsd_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/548704 (owner: 10Alexandros Kosiaris) [10:17:04] (03PS1) 10Alexandros Kosiaris: Revert "Revert "hiera: update ores to pass statsd through statsd_exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/548705 [10:17:32] hmm.. CI is kinda slow or is just me? [10:24:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [10:28:05] (03PS1) 10Jbond: profile::base: reorder class [puppet] - 10https://gerrit.wikimedia.org/r/548706 (https://phabricator.wikimedia.org/T237259) [10:28:36] PROBLEM - traffic_server backend process restarted on cp5012 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5012&var-layer=backend [10:29:46] known ^ [10:29:52] RECOVERY - traffic_server backend process restarted on cp5012 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5012&var-layer=backend [10:29:54] permissions stuff again? :) [10:30:37] yup [10:31:43] (03PS2) 10Jbond: profile::base: reorder class [puppet] - 10https://gerrit.wikimedia.org/r/548706 (https://phabricator.wikimedia.org/T237259) [10:39:21] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [10:39:50] 10Operations, 10DBA, 10serviceops, 10Patch-For-Review: Backups on buster hosts fail to run - https://phabricator.wikimedia.org/T235838 (10jcrespo) 05Open→03Resolved [10:39:55] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [10:48:17] 10Operations: Important nagios-nrpe-server errors not showing up in unit journal - https://phabricator.wikimedia.org/T237236 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:55:01] (03PS1) 10Elukey: Add Bacula backups for Analytics Meta's mylvm snapshots [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) [10:55:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 9:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547505 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [10:56:27] (03PS1) 10Vgutierrez: prometheus: Aggregate ats response code metrics [puppet] - 10https://gerrit.wikimedia.org/r/548712 (https://phabricator.wikimedia.org/T236482) [10:56:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548264 (https://phabricator.wikimedia.org/T213089) (owner: 10Elukey) [10:56:36] (03PS2) 10Elukey: Add Bacula backups for Analytics Meta's mylvm snapshots [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) [10:56:50] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Aggregate ats response code metrics [puppet] - 10https://gerrit.wikimedia.org/r/548712 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [10:56:58] :_( [10:57:14] (03CR) 10Elukey: [C: 03+2] role::prometheus::beta: add memcached metrics [puppet] - 10https://gerrit.wikimedia.org/r/548264 (https://phabricator.wikimedia.org/T213089) (owner: 10Elukey) [10:57:23] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/547519 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [10:57:28] (03PS2) 10Vgutierrez: prometheus: Aggregate ats response code metrics [puppet] - 10https://gerrit.wikimedia.org/r/548712 (https://phabricator.wikimedia.org/T236482) [10:57:49] elukey: merging your change too [10:57:55] <3 [10:59:02] !log pool cp5012 with ATS backend T227432 [10:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:07] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:00:46] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [11:01:44] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:03:22] (03CR) 10Filippo Giunchedi: mtail: add logstash program (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548280 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [11:04:00] (03CR) 10Ema: [C: 03+1] prometheus: Aggregate ats response code metrics [puppet] - 10https://gerrit.wikimedia.org/r/548712 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [11:04:25] (03CR) 10Jcrespo: [C: 03+1] "+1 for the bacula part" [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [11:08:27] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Aggregate ats response code metrics [puppet] - 10https://gerrit.wikimedia.org/r/548712 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [11:11:00] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10fgiunchedi) p:05High→03Normal [11:12:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add Bacula backups for Analytics Meta's mylvm snapshots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [11:15:21] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10fgiunchedi) 05Open→03Stalled Stalling since I don't think we've seen a reoccurence yet. Now swiftrepl runs as a timer+service once a week, overlapping ru... [11:16:18] (03CR) 10Elukey: Add Bacula backups for Analytics Meta's mylvm snapshots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [11:17:15] (03PS3) 10Elukey: Add Bacula backups for Analytics Meta's mylvm snapshots [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) [11:18:28] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:20:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add Bacula backups for Analytics Meta's mylvm snapshots [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [11:21:53] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10fgiunchedi) Looks like this is working now! cc @Krinkle https://logstash.wikimedia.org/app/kib... [11:22:03] (03PS1) 10Jcrespo: Add bakcup to typos [puppet] - 10https://gerrit.wikimedia.org/r/548714 [11:22:10] ^akosiaris: ottomata: [11:22:33] lol [11:22:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add bakcup to typos [puppet] - 10https://gerrit.wikimedia.org/r/548714 (owner: 10Jcrespo) [11:25:39] others I can think of I make bakula, backula, graphana [11:26:13] but I don't see any on repo, so not ongoing issue [11:34:23] (03CR) 10Jbond: "I'm not sure you caught the error conditions correctly but i may be missing something" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [11:38:02] (03CR) 10Jbond: "snipe alert ;): for additional points i thought we should have been able to use /proc/locks or lslocks instead of fuser however the PID st" [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [11:41:08] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) [11:44:43] (03CR) 10Arturo Borrero Gonzalez: new k8s: adjust things to be compatible with migration to the new cluster (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [11:47:21] (03PS1) 10Jon Harald Søby: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) [11:50:23] (03CR) 10Jbond: [C: 03+2] puppet git: add a descriptive config version [puppet] - 10https://gerrit.wikimedia.org/r/547505 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [11:50:25] (03PS1) 10Jon Harald Søby: Add Sakizaya (szy) language [dns] - 10https://gerrit.wikimedia.org/r/548718 (https://phabricator.wikimedia.org/T237369) [11:50:36] (03CR) 10Jbond: [C: 03+2] motd: add the config version to the MOTD [puppet] - 10https://gerrit.wikimedia.org/r/547506 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [11:52:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:52:13] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/548718 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [11:54:32] (03CR) 10Urbanecm: [C: 03+1] "overall, looks good, thanks! Small issue noted inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [11:56:11] (03CR) 10Urbanecm: [C: 04-1] "02:04:10 FILE: wmf-config/CommonSettings.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [11:57:27] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) [11:57:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:57:51] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:58:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:58:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:31] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:58:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:37] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) a:05zeljkofilipin→03None [11:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:45] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:58:45] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:42] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10Jclark-ctr) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T1200). [12:00:04] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10Jclark-ctr) 05Open→03Resolved [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:06] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Jclark-ctr) [12:01:00] (03PS2) 10Jon Harald Søby: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) [12:01:09] nothing to SWAT, it looks like [12:01:18] (03PS1) 10Jbond: motd: correct awk format [puppet] - 10https://gerrit.wikimedia.org/r/548720 (https://phabricator.wikimedia.org/T228854) [12:01:47] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) starting pdu refresh . [12:03:50] (03CR) 10Jbond: [C: 03+2] motd: correct awk format [puppet] - 10https://gerrit.wikimedia.org/r/548720 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [12:04:29] Lucas_WMDE: yup [12:08:26] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, 10User-jbond: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10jbond) 05Open→03Resolved a:03jbond Think this is complete now [12:11:10] (03PS1) 10Mobrovac: RESTRouter: Add szywiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/548724 (https://phabricator.wikimedia.org/T237374) [12:11:51] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add szywiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/548724 (https://phabricator.wikimedia.org/T237374) (owner: 10Mobrovac) [12:15:31] (03PS1) 10Mathew.onipe: prometheus: migrate wmf prometheus elasticsearch exporter to python3 [puppet] - 10https://gerrit.wikimedia.org/r/548725 [12:27:19] (03PS1) 10KartikMistry: WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [12:28:15] (03CR) 10jerkins-bot: [V: 04-1] WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [12:32:38] (03PS2) 10Gehel: prometheus: migrate wmf prometheus elasticsearch exporter to python3 [puppet] - 10https://gerrit.wikimedia.org/r/548725 (owner: 10Mathew.onipe) [12:32:43] (03PS2) 10KartikMistry: WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [12:33:27] (03CR) 10jerkins-bot: [V: 04-1] WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [12:34:38] (03CR) 10Gehel: [C: 03+2] prometheus: migrate wmf prometheus elasticsearch exporter to python3 [puppet] - 10https://gerrit.wikimedia.org/r/548725 (owner: 10Mathew.onipe) [12:39:48] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5634559, @elukey wrote: >>>! In T233661#5632172, @BBlack wrote: >> Agreed, let's not go down that road right here... [12:41:08] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5633768, @Nuria wrote: > @BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipel... [12:52:03] (03PS3) 10KartikMistry: WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [12:52:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [12:57:02] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) I attempted a [[ https://github.com/puppetlabs/puppet/pull/7802 | patch for this upstream ]] although its not quite working yet [13:00:07] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [13:04:05] (03PS4) 10KartikMistry: WIP: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [13:10:36] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:17:37] (03PS1) 10Majavah: Add 104 (Cookbook) to $wgContentNamespaces for bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548738 (https://phabricator.wikimedia.org/T236840) [13:21:10] (03PS1) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [13:21:12] (03PS1) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [13:22:09] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) finished pdu refresh [13:23:06] (03CR) 10jerkins-bot: [V: 04-1] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [13:23:38] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) [13:24:41] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) a:05Cmjohnson→03RobH ` # pmshell 1: ps1-a1-eqiad 2: ps1-a2-eqiad 3: ps1-a3-eqiad 4: ps1-a4-eqiad 5: ps1-a5-eqiad 6: ps1-a6-eqiad 7: ps1-a7-eqi... [13:29:09] (03PS2) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [13:29:11] (03PS2) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [13:31:03] (03CR) 10jerkins-bot: [V: 04-1] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [13:32:45] (03CR) 10Andrew Bogott: "seems ok to me in theory, although I'm a bit concerned with the puppet compiler output:" [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [13:37:58] (03PS3) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [13:38:00] (03PS3) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [13:39:48] (03CR) 10jerkins-bot: [V: 04-1] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [13:43:18] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: move thirdparty/kubeadm-k8s packages from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/548741 [13:45:56] (03PS3) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [13:49:48] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:52:40] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/19247/" [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [13:53:31] (03PS4) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [13:53:33] (03PS4) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [13:55:18] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:55:24] (03CR) 10jerkins-bot: [V: 04-1] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [13:55:33] (03CR) 10CDanis: puppet-merge: show who is holding the lock (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [13:58:14] (03CR) 10Elukey: "Mforns? :)" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [14:00:31] (03CR) 10CDanis: [C: 03+1] swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [14:00:58] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:01:46] (03PS1) 10Ema: motd: avoid printing double quotes and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/548743 (https://phabricator.wikimedia.org/T228854) [14:02:59] (03PS5) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [14:03:01] (03PS5) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [14:03:28] (03CR) 10Elukey: "Looks good to me, a couple of questions:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [14:03:50] (03CR) 10Elukey: [C: 03+1] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/548488 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [14:04:54] (03CR) 10jerkins-bot: [V: 04-1] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [14:05:32] 10Operations, 10observability, 10serviceops: basic prometheus monitoring for PoolCounter - https://phabricator.wikimedia.org/T237407 (10CDanis) [14:09:38] (03PS6) 10Filippo Giunchedi: swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) [14:09:40] (03PS6) 10Filippo Giunchedi: swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) [14:16:31] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/548743 (https://phabricator.wikimedia.org/T228854) (owner: 10Ema) [14:16:35] (03CR) 10Jbond: [C: 03+1] motd: avoid printing double quotes and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/548743 (https://phabricator.wikimedia.org/T228854) (owner: 10Ema) [14:17:28] (03PS1) 10Ema: ATS: remap stream.wm.org websocket requests [puppet] - 10https://gerrit.wikimedia.org/r/548747 (https://phabricator.wikimedia.org/T227432) [14:17:42] (03CR) 10Filippo Giunchedi: [C: 03+2] swiftrepl: add ensure [puppet] - 10https://gerrit.wikimedia.org/r/548739 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [14:17:52] (03CR) 10Filippo Giunchedi: [C: 03+2] swiftrepl: disable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/548740 (https://phabricator.wikimedia.org/T231110) (owner: 10Filippo Giunchedi) [14:18:07] (03CR) 10Ema: [C: 03+2] motd: avoid printing double quotes and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/548743 (https://phabricator.wikimedia.org/T228854) (owner: 10Ema) [14:18:35] (03CR) 10Ema: [C: 03+2] ATS: remap stream.wm.org websocket requests [puppet] - 10https://gerrit.wikimedia.org/r/548747 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:20:57] (03CR) 10Ottomata: "Hm, are we doing this so we can do Kerberos sooner rather than later? I had assumed we'd just wait for the db1108 backups to start workin" [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [14:22:57] (03CR) 10Elukey: "> Hm, are we doing this so we can do Kerberos sooner rather than" [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [14:25:10] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Added metrics to http://beta-prometheus.wmflabs.org about memcached: http://beta-prometheus.wmflabs.org/beta/graph?g0... [14:29:22] (03PS1) 10DCausse: [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) [14:29:24] (03PS1) 10DCausse: [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) [14:30:20] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [14:30:37] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse) [14:32:19] (03PS3) 10Ottomata: Include hdfs_cleaner an an-coord node to clean HDFS /tmp dir [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) [14:35:35] (03PS15) 10CDanis: rsync: add option to TLS-wrap communications [puppet] - 10https://gerrit.wikimedia.org/r/547527 [14:35:37] (03CR) 10CDanis: rsync: add option to TLS-wrap communications (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (owner: 10CDanis) [14:36:12] (03PS2) 10DCausse: [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) [14:36:14] (03PS2) 10DCausse: [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) [14:37:15] !log reducing consistency temporarilly on db1114 so it can catch up replication [14:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:16] (03PS2) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [14:44:03] (03CR) 10RLazarus: httpbb: Create a new Puppet module for httpbb. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [14:45:26] (03PS3) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [14:50:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:50:49] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:51:11] (03CR) 10Elukey: [C: 03+1] "Previous questions are not a blocker, if you want to proceed feel free to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:03:53] (03CR) 10Ottomata: "> 1) Do we want to test it on Hadoop test first, just to be sure?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:04:12] (03PS1) 10Elukey: Add TLS certificates to Hadoop Analytics master nodes [puppet] - 10https://gerrit.wikimedia.org/r/548759 (https://phabricator.wikimedia.org/T237269) [15:04:56] (03CR) 10Faidon Liambotis: [C: 04-1] Initial forwarding-options templating (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547586 (owner: 10Ayounsi) [15:06:48] (03CR) 10Elukey: [C: 03+1] ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:07:51] (03CR) 10Elukey: [C: 03+2] Add TLS certificates to Hadoop Analytics master nodes [puppet] - 10https://gerrit.wikimedia.org/r/548759 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [15:16:13] 10Operations, 10Icinga, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10faidon) Thanks @herron! Should we resolve this? [15:17:51] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:18:52] (03PS1) 10Ottomata: eventgate-analytics - stream config for new stream wdqs.sparql-query [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) [15:19:22] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Patch-For-Review: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) 05Open→03Resolved a:03Mholloway The replacement of t... [15:20:08] !log cp4027: upgrade trafficserver to 8.0.5-1wm10 [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10elukey) >>! In T233661#5635537, @BBlack wrote: >>>! In T233661#5634559, @elukey wrote: >>>>! In T233661#5632172, @BBlack wrote: >>> Agree... [15:25:45] 10Operations: stunnel-wrap all rsync::server usage - https://phabricator.wikimedia.org/T237424 (10CDanis) [15:26:18] (03PS16) 10CDanis: rsync: add option to TLS-wrap communications [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) [15:28:49] 10Operations, 10Icinga, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10herron) 05Open→03Resolved a:03herron Yes, I think we're in good shape here [15:34:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [15:35:37] 10Operations, 10Traffic: ats-tls-restart failed on cp4027 - https://phabricator.wikimedia.org/T237425 (10ema) [15:36:53] (03PS5) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [15:36:55] (03PS1) 10Jbond: wmflib: add puppet_settings facts [puppet] - 10https://gerrit.wikimedia.org/r/548767 [15:37:31] 10Operations, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10aborrero) [15:37:33] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10aborrero) 05Resolved→03Open Reopening task per @faidon suggestion, since there are a few leftover bits to be resolved: * patch https://gerrit.wikimedia.org/r/c/ope... [15:37:34] (03PS2) 10Ottomata: eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) [15:37:37] !log deploying schema change on x1 T234955 [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:42] T234955: API for reverts for SE v3 - https://phabricator.wikimedia.org/T234955 [15:39:27] (03PS3) 10Ottomata: eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) [15:40:18] (03CR) 10DCausse: eventgate-analytics - stream config for new sparql-query streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [15:41:16] (03CR) 10DCausse: [C: 03+1] eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [15:46:21] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [15:46:32] (03CR) 10Ottomata: "OK! I have no items in my .Trash. Makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:47:06] (03CR) 10Ottomata: [C: 03+1] "Ok!" [puppet] - 10https://gerrit.wikimedia.org/r/548711 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [15:47:29] (03CR) 10Jbond: [C: 03+2] wmflib: add puppet_settings facts [puppet] - 10https://gerrit.wikimedia.org/r/548767 (owner: 10Jbond) [15:50:15] (03CR) 10Jcrespo: [C: 03+1] "I confirm the script at https://gerrit.wikimedia.org/r/c/operations/puppet/+/547283 is working (I have tested live both failing and succee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [15:50:18] (03PS1) 10RobH: ps1-b7-eqiad model setting [puppet] - 10https://gerrit.wikimedia.org/r/548769 (https://phabricator.wikimedia.org/T227542) [15:51:27] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:52:24] 10Operations, 10Wikimedia-Mailing-lists: Close Knowledge Integrity mailing list - https://phabricator.wikimedia.org/T237427 (10Samwalton9) [15:53:17] (03CR) 10RobH: [C: 03+2] ps1-b7-eqiad model setting [puppet] - 10https://gerrit.wikimedia.org/r/548769 (https://phabricator.wikimedia.org/T227542) (owner: 10RobH) [15:55:57] (03CR) 10Alexandros Kosiaris: ">This is in response to tools crashing nodes and the copytruncate setting is being used. I manually truncated the logs to fix things. I'm " [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) (owner: 10Bstorm) [15:56:03] (03CR) 10Ayounsi: [C: 03+2] Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [15:56:11] (03PS4) 10Ayounsi: Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 [15:57:02] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [15:58:34] (03PS5) 10Elukey: report updater job: produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [16:06:20] (03PS6) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [16:07:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/548269 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [16:07:25] (03CR) 10jerkins-bot: [V: 04-1] puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [16:08:07] (03CR) 10Elukey: [C: 03+2] report updater job: produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [16:09:48] (03PS4) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [16:11:02] (03PS5) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [16:11:53] (03CR) 10CDanis: "Thanks! These were all very good questions -- and as written it was definitely confusing. I added some comments for posterity" [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [16:12:19] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10elukey) >>! In T220387#5093827, @elukey wrote: > One thing that we didn't discuss for this goal is Zookeeper. At the moment multi... [16:12:24] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Plan for migration: - base install of buster on new dumpsdata server with role::spare - make a mount point for thedata filesystem, install rsync, copy over data from labsto... [16:12:48] (03PS6) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [16:12:54] (03CR) 10CDanis: [C: 03+2] puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 (owner: 10CDanis) [16:15:28] 10Operations, 10Wikispeech-Text-to-Speech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072 (10Sebastian_Berlin-WMSE) [16:20:26] (03PS2) 10Dmaza: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) [16:24:27] !log Replacing disk on db2120 [16:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:15] (03PS1) 10Ayounsi: Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 [16:29:23] (03CR) 10Ottomata: [C: 03+2] Include hdfs_cleaner an an-coord node to clean HDFS /tmp dir [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [16:30:20] (03CR) 10jerkins-bot: [V: 04-1] Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:30:42] (03CR) 10CDanis: Icinga, fix mgmt_parents hiera call for cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:31:42] (03PS7) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [16:31:44] (03PS1) 10Jbond: puppetmaster: update spec tests to use shared helper [puppet] - 10https://gerrit.wikimedia.org/r/548782 [16:31:53] (03PS2) 10Ayounsi: Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 [16:32:48] (03CR) 10CDanis: [C: 03+1] Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:32:50] (03CR) 10Ayounsi: Icinga, fix mgmt_parents hiera call for cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:33:31] (03CR) 10Jbond: [C: 03+2] puppetmaster: update spec tests to use shared helper [puppet] - 10https://gerrit.wikimedia.org/r/548782 (owner: 10Jbond) [16:33:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:35:09] (03CR) 10Ayounsi: [C: 03+2] Icinga, fix mgmt_parents hiera call for cloud [puppet] - 10https://gerrit.wikimedia.org/r/548781 (owner: 10Ayounsi) [16:36:09] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10Papaul) a:05Papaul→03jcrespo Disk replaced [16:38:36] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10jcrespo) I see it rebuilding, will update when it finishes to confirm it was ok. [16:39:29] (03CR) 10Andrew Bogott: "Not sure if this is a pcc artifact, but... https://puppet-compiler.wmflabs.org/compiler1001/19253/cloud-puppetmaster-01.cloudinfra.eqiad.w" [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [16:39:47] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) SRS.CiscoDesk Hello. Your pickup is scheduled for Thursday, 11.07.19. The carrier will be Tex Air Delivery. (Driver’s Name will be provided day of pickup Attached is your Bill of Lading and Sh... [16:42:34] 10Operations, 10Wikimedia-Mailing-lists, 10Mobile: Pipermail on lists.wikimedia.org is not mobile friendly - https://phabricator.wikimedia.org/T190054 (10MarcoAurelio) I think Mailman v3 interfaces do solve this issue. [16:42:43] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [16:45:09] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10RobH) [16:45:31] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) [16:45:34] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10RobH) a:05RobH→03Jclark-ctr [] - clear icinga errors for missing ps2 input by connecting/checking connection of the rj11 cable connection between ps1 and ps2 b7-eqiad. Once i... [16:46:20] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10Papaul) Return information {F31011145} [16:47:34] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: move thirdparty/kubeadm-k8s packages from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/548741 [16:47:47] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10RobH) Once the towers are linked, the errors on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-b7-eqiad should clear up and go green for tower B. [16:48:07] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) @Dzahn: @Joe and I chatted and we decided to bump the memory limit by 100MB for the parsoid cluster. Can you prepare the patches for this? Thanks! [16:49:31] (03PS3) 10Ayounsi: Intial interfaces templates [homer/public] - 10https://gerrit.wikimedia.org/r/547584 [16:50:58] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Patch-For-Review: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) [16:50:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: move thirdparty/kubeadm-k8s packages from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/548741 (owner: 10Arturo Borrero Gonzalez) [16:51:17] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10Papaul) a:05Papaul→03None [16:51:54] (03PS8) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [16:56:31] !log deleted stretch-wikimedia/thirdparty/kubeadm-k8s and created buster-wikimedia/thirdparty/kubeadm-k8s [16:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:43] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Patch-For-Review: Reindex commonswiki as shards have grown beyond critical threshold - https://phabricator.wikimedia.org/T231446 (10Mathew.onipe) T230746 is almost here. So we won't be resharding any index for now [17:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:31] ACKNOWLEDGEMENT - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 91.46% of data under the critical threshold [5.0] Mathew.onipe https://phabricator.wikimedia.org/T237228 - The acknowledgement expires at: 2019-11-08 16:59:58. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [17:07:00] (03PS1) 10RobH: adding 1tb sata sku [software] - 10https://gerrit.wikimedia.org/r/548804 [17:07:37] (03PS2) 10RobH: adding 2tb sata sku [software] - 10https://gerrit.wikimedia.org/r/548804 [17:08:30] (03CR) 10RobH: [C: 03+2] adding 2tb sata sku [software] - 10https://gerrit.wikimedia.org/r/548804 (owner: 10RobH) [17:09:02] (03Merged) 10jenkins-bot: adding 2tb sata sku [software] - 10https://gerrit.wikimedia.org/r/548804 (owner: 10RobH) [17:11:53] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:17:59] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [17:21:29] !log restarting etherpad [17:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:09] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8963 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [17:28:27] drop unused log.NavigationTiming tables from db110[7,8] - T233891 [17:28:28] T233891: Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 [17:35:17] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10RobH) p:05Triage→03High [17:35:36] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10RobH) [17:41:29] (03PS4) 10Ayounsi: Intial interfaces templates [homer/public] - 10https://gerrit.wikimedia.org/r/547584 [17:43:17] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10RobH) [17:43:19] 10Operations, 10Traffic, 10codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482 (10BBlack) 05Open→03Declined We're not going down this road at all. `cache::route_table` will just go away when all cache backends have converted to ATS in T227432, which doesn'... [17:47:11] (03PS1) 10Gergő Tisza: Use production RESTBase in beta GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548814 [17:49:11] 10Operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181 (10BBlack) 05Open→03Resolved a:03BBlack Yes, this task was long-ago completed. See also https://phabricator.wikimedia.org/phame/post/view/111/wikipedia_goes_100_forward_secret/ [17:50:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [17:52:34] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) @MBeat33 / @Jseddon - Any update yet? [17:53:34] (03CR) 10CDanis: rsync: add option to TLS-wrap communications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [17:53:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC's a NOOP https://puppet-compiler.wmflabs.org/compiler1001/19257/" [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [17:56:07] 10Operations, 10Traffic: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 (10BBlack) 05Open→03Resolved a:03ayounsi [17:56:11] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [17:57:44] (03CR) 10Ayounsi: Initial forwarding-options templating (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547586 (owner: 10Ayounsi) [17:58:40] cdanis: yeah fail() on whitespace why not, better than known-broken [17:58:41] (03PS5) 10Ayounsi: Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 [17:58:56] 10Operations, 10Traffic: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525 (10BBlack) 05Open→03Declined I don't think we'll go the LVS route here. [17:58:59] 10Operations, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10BBlack) [17:59:04] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [17:59:42] cdanis: the patch is 100% good to go as is I think FWIW [17:59:49] godog: thanks! [17:59:54] I think the fail will be easy [18:00:04] cscott, arlolra, subbu, halfak, accraze, and mdholloway: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T1800). [18:00:12] No parsoid deploy today [18:00:33] cdanis: hahaha that's how puppet keeps us on our toes, easy things can be hard [18:00:53] RECOVERY - MegaRAID on db2120 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:06:33] 10Operations, 10Traffic, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) [18:06:35] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [18:06:38] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:07:59] 10Operations, 10Traffic, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) 05Open→03Resolved a:03BBlack With anycast recdns deployed at all sites with fallback routing towards the cores (or to the opposite core, as the case may be),... [18:08:06] 10Operations, 10Traffic: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498 (10BBlack) I think this is actually fairly orthogonal to some of the other improvements. Not sure what current/modern thinking is on this either, probably needs re-evaluation. My gut feeling it... [18:08:48] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:08:56] 10Operations, 10Traffic, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10BBlack) [18:08:58] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [18:12:08] (03PS17) 10CDanis: rsync: add option to TLS-wrap communications [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) [18:13:21] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Dzahn) We should probably still remove the math packages (but not ploticus!) from all mw appservers. It was a bit confusing why some servers were different from others in: T237304 -> https... [18:14:45] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:15:26] (03PS18) 10CDanis: rsync: add option to TLS-wrap communications [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) [18:16:43] (03PS5) 10Ayounsi: Intial interfaces templates [homer/public] - 10https://gerrit.wikimedia.org/r/547584 [18:21:10] subbu: Can I claim this window for a deployment? [18:21:28] sure. no blockers from me. [18:23:18] Thanks! [18:23:36] (03PS3) 10Niharika29: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [18:24:05] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [18:24:51] (03Merged) 10jenkins-bot: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548568 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [18:27:07] dmaza: You can test ^ on mwdebug1001. [18:27:20] on it [18:28:12] (03CR) 10CDanis: [C: 03+2] "PCC run so far reveals only expected diffs (hiera addition for effective no-op) and failures are just hosts that also fail without the cha" [puppet] - 10https://gerrit.wikimedia.org/r/547527 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [18:30:24] !log fix typo on cr1-eqsin:lo0.0 v6 IP [18:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:28] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10jcrespo) ` Firmware state: Online, Spun Up State : Optimal === RaidStatus (does not include components in optimal state) === RaidStatus completed ` Everything turned up good, please proceed with... [18:32:54] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10jcrespo) a:05jcrespo→03Papaul [18:32:59] (03CR) 10Mholloway: [C: 04-1] "hold for scheduled deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547741 (https://phabricator.wikimedia.org/T236797) (owner: 10Mholloway) [18:35:33] Niharika: just a couple more minutes [18:35:53] dmaza: No rush. [18:37:49] (03PS1) 10Mholloway: WikimediaEditorTasks: Enable edit streaks in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548832 (https://phabricator.wikimedia.org/T234956) [18:39:56] (03PS1) 10Niharika29: Revert "Enable SpecialMute page on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548833 [18:40:17] (03CR) 10Niharika29: [C: 03+2] "Reverting patch (swat)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548833 (owner: 10Niharika29) [18:40:52] !log MediaWiki train: start branching wmf/1.35.0-wmf.5 [18:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:04] (03Merged) 10jenkins-bot: Revert "Enable SpecialMute page on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548833 (owner: 10Niharika29) [18:44:43] Niharika: Could you please let me know when you're all done? I've got a config change I'd like to deploy as well. [18:44:46] (03CR) 10Krinkle: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [18:44:55] mdholloway: I'm done. [18:45:00] Niharika: Great, thank you! [18:45:15] (03PS2) 10Mholloway: WikimediaEditorTasks: Enable edit streaks in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548832 (https://phabricator.wikimedia.org/T234956) [18:46:28] (03CR) 10Bstorm: "Validated that tools-manifest and webservice monitor use the frontend for gridengine only and have no interface to the kubernetesbackend.p" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [18:48:19] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:48:27] (03PS3) 10Mholloway: WikimediaEditorTasks: Enable streaks and revert counts in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548832 (https://phabricator.wikimedia.org/T234956) [18:49:03] (03PS4) 10Mholloway: WikimediaEditorTasks: Enable streaks and revert counts in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548832 (https://phabricator.wikimedia.org/T234955) [18:50:41] Actually, no config deploy for me, that's not quite ready to be deployed yet. [18:51:50] (03CR) 10Mholloway: [C: 04-1] "Hold until the wikimedia_editor_tasks_edit_streak table is created in x1/wikishared" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548832 (https://phabricator.wikimedia.org/T234955) (owner: 10Mholloway) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T1900) [19:04:20] (03CR) 10Ayounsi: "Now that we're making progress in https://gerrit.wikimedia.org/r/c/operations/homer/public/+/547584 and https://gerrit.wikimedia.org/r/c/o" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (https://phabricator.wikimedia.org/T223161) (owner: 10Ayounsi) [19:04:49] (03CR) 10Krinkle: [C: 04-1] "Indeed. If we are worried about JS payload on rare articles without any links, then that would be a sensible optimisation to make in Popup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547653 (https://phabricator.wikimedia.org/T235797) (owner: 10Jdlrobson) [19:05:37] (03PS1) 10Cmjohnson: Adding dhcp file and netboot cfg file for dumpsdata1003 [puppet] - 10https://gerrit.wikimedia.org/r/548840 (https://phabricator.wikimedia.org/T234076) [19:06:00] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) Cisco servers ready for pickup {F31012445} [19:08:55] @seen joal [19:08:55] mutante: Last time I saw joal they were quitting the network with reason: Ping timeout: 268 seconds N/A at 11/5/2019 6:56:40 PM (12m14s ago) [19:09:35] (03Abandoned) 10Jdlrobson: Do not load page previews on Special:BlankPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547653 (https://phabricator.wikimedia.org/T235797) (owner: 10Jdlrobson) [19:17:05] (03PS2) 10Gergő Tisza: Use production RESTBase in beta GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548814 [19:18:37] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:19:58] (03CR) 10Cmjohnson: [C: 03+2] Adding dhcp file and netboot cfg file for dumpsdata1003 [puppet] - 10https://gerrit.wikimedia.org/r/548840 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson) [19:21:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) [19:22:25] 10Operations, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repar "svn.wikimedia.org/doc/ should redirect to doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) [19:22:34] 10Operations, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) [19:27:33] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:27:56] 10Operations, 10Core Platform Team, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) [19:29:28] 10Operations, 10Core Platform Team, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Krinkle) I've asked about this on IRC. It seems to only take a 1 line change in Puppet to fix the... [19:37:18] (03PS3) 10Andrew Bogott: toolforge: Remove direct TLS termination support from static-server [puppet] - 10https://gerrit.wikimedia.org/r/547363 (https://phabricator.wikimedia.org/T236952) (owner: 10Alex Monk) [19:38:38] (03CR) 10Ayounsi: "This change is ready for review." [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (owner: 10Ayounsi) [19:38:45] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:39:48] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Remove direct TLS termination support from static-server [puppet] - 10https://gerrit.wikimedia.org/r/547363 (https://phabricator.wikimedia.org/T236952) (owner: 10Alex Monk) [19:40:44] (03CR) 10Kosta Harlan: [C: 03+1] Use production RESTBase in beta GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548814 (owner: 10Gergő Tisza) [19:41:37] (03PS1) 10Ottomata: Use HDFS trash settings as default everywhere [puppet] - 10https://gerrit.wikimedia.org/r/548850 (https://phabricator.wikimedia.org/T235200) [19:43:13] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) @BBlack the current proposal is: - On the netbox host(s) there will be a script to generate the snippet files that will perform some... [19:43:22] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) p:05Triage→03Normal [19:45:25] (03PS3) 10Ayounsi: Initial netbox interfaces support [software/homer] - 10https://gerrit.wikimedia.org/r/547562 [19:46:01] (03CR) 10Ayounsi: "Jenkins error seem unfounded." [software/homer] - 10https://gerrit.wikimedia.org/r/547562 (owner: 10Ayounsi) [19:47:56] (03CR) 10jerkins-bot: [V: 04-1] Initial netbox interfaces support [software/homer] - 10https://gerrit.wikimedia.org/r/547562 (owner: 10Ayounsi) [19:51:50] (03PS1) 10Alex Monk: tools-static: Fix old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/548861 [19:52:16] (03PS2) 10Alex Monk: tools-static: Fix old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/548861 (https://phabricator.wikimedia.org/T236952) [19:53:48] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/19259/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/548850 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [19:55:06] (03CR) 10Andrew Bogott: [C: 03+2] tools-static: Fix old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/548861 (https://phabricator.wikimedia.org/T236952) (owner: 10Alex Monk) [19:59:53] (03PS2) 10Alex Monk: toolforge: Remove old absented star.wmflabs.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/547364 (https://phabricator.wikimedia.org/T236952) [20:00:04] twentyafterfour and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - American version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T2000). [20:00:08] !log joal@deploy1001 Started deploy [analytics/refinery@8013a86]: Analytics deploy for spark upgrade [20:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:02] (03PS1) 10Mholloway: MachineVision: Do not restrict to testing users in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548868 [20:06:04] (03PS1) 10Mholloway: Configure and enable MachineVision on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) [20:06:24] (03CR) 10Mholloway: [C: 04-1] "Hold for deploy window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548868 (owner: 10Mholloway) [20:06:35] (03CR) 10Mholloway: [C: 04-1] "Hold for deploy window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) (owner: 10Mholloway) [20:08:43] (03PS2) 10Mholloway: Configure and enable MachineVision on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) [20:08:54] (03CR) 10Mholloway: [C: 04-1] Configure and enable MachineVision on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) (owner: 10Mholloway) [20:08:57] !log joal@deploy1001 Finished deploy [analytics/refinery@8013a86]: Analytics deploy for spark upgrade (duration: 08m 49s) [20:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:28] !log joal@deploy1001 Started deploy [analytics/refinery@ea631bd]: Analytics deploy for spark upgrade - forgotten patch [20:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:20] (03PS1) 10CDanis: rsync stunnel: allow both cleartext and encrypted service [puppet] - 10https://gerrit.wikimedia.org/r/548873 (https://phabricator.wikimedia.org/T237424) [20:16:05] (03PS2) 10Ottomata: Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/548488 (https://phabricator.wikimedia.org/T222253) [20:17:25] (03CR) 10CDanis: [C: 03+2] rsync stunnel: allow both cleartext and encrypted service [puppet] - 10https://gerrit.wikimedia.org/r/548873 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [20:17:49] !log joal@deploy1001 Finished deploy [analytics/refinery@ea631bd]: Analytics deploy for spark upgrade - forgotten patch (duration: 08m 21s) [20:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:12] !log push fw policies to pfw3-codfw - T236201 [20:23:12] (03PS4) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [20:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:43] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,POST,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:24:33] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,create,get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:24:33] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:24:39] (03CR) 10RLazarus: "PTAL" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [20:24:51] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:25:05] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:25:23] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:41] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:27:45] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:27:47] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:03] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:19] (03PS3) 10Mholloway: Configure and enable MachineVision on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) [20:28:25] hello, is Gerrit having trouble? i'm getting issues trying to git pull [20:28:31] RECOVERY - k8s API server requests latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:31] (03CR) 10Mholloway: [C: 04-1] Configure and enable MachineVision on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548869 (https://phabricator.wikimedia.org/T227349) (owner: 10Mholloway) [20:28:44] (03CR) 10Ottomata: [C: 03+2] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/548488 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [20:29:01] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:30:43] MatmaRex: not that I'm aware of? what repo is giving you trouble? [20:31:29] in fact, i can't ping it, but it's still working in the browser. fascinating [20:32:30] eh, i think it's on my side [20:33:03] !log push fw policies to pfw3-eqiad - T236201 [20:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:18] MatmaRex: I just cloned the entire tree of mediawiki + deployed extensions and it didn't seem to have any issues [20:35:07] for the record, these are the commands and errors i'm getting: https://phabricator.wikimedia.org/P9530 [20:35:12] but i think the problem is on my side, so please carry on :) [20:36:21] MatmaRex: are you using persistent connections? [20:36:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) [20:37:58] MatmaRex: try doing just ssh -p29418 matmarex@gerrit.wikimedia.org [20:38:08] you should get a message along the lines [20:38:20] "Hi MatmaRex, you have successfully connected over SSH." [20:38:27] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10Papaul) 05Open→03Resolved [20:39:35] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:39:41] ssh: connect to host gerrit.wikimedia.org port 29418: Connection timed out [20:40:00] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Dzahn) 05Resolved→03Open [20:43:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:43] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `cobalt.wikimedia.org` - cobalt.wikimedia.org (**PASS**) - Downtimed host on Icinga - Downtimed management interface on Icinga... [20:45:11] !log shutting down cobalt (formerly gerrit server) [20:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:38] 10Operations, 10ops-eqiad: Degraded RAID on cobalt - https://phabricator.wikimedia.org/T237457 (10ops-monitoring-bot) [20:46:52] ^ lol, wut [20:47:01] 10 seconds after decom the RAID breaks? [20:47:09] or it thinks it did [20:47:29] (03PS1) 10CDanis: rsync stunnel: enable for netmon1002 & /var/lib/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/548877 (https://phabricator.wikimedia.org/T237424) [20:50:49] (03PS1) 10Jhedden: install_server: add cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/548878 (https://phabricator.wikimedia.org/T228102) [20:51:14] 10Operations, 10ops-eqiad: Degraded RAID on cobalt - https://phabricator.wikimedia.org/T237457 (10Dzahn) 05Open→03Invalid This happened about 10 seconds after running the decom cookbook on T236187. Server was just down.. [20:52:32] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [20:53:07] (03CR) 10Awight: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [20:53:32] (03PS1) 10Herron: logstash: add version param and exclude plugins when non 5.x [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:54:37] (03PS1) 10Dzahn: site: remove node cobalt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/548881 (https://phabricator.wikimedia.org/T236187) [20:54:47] (03PS2) 10Herron: logstash: add version param and exclude plugins when non 5.x [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:55:39] (03PS2) 10Dzahn: site: remove node cobalt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/548881 (https://phabricator.wikimedia.org/T236187) [20:55:41] (03PS1) 10Papaul: DNS: Remove mgmt DNS for dbstore200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/548882 [20:56:31] (03CR) 10Dzahn: [V: 03+1 C: 03+1] DNS: Remove mgmt DNS for dbstore200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/548882 (owner: 10Papaul) [20:57:24] (03CR) 10Paladox: [C: 03+1] site: remove node cobalt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/548881 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [20:57:34] (03CR) 10Dzahn: [C: 03+2] site: remove node cobalt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/548881 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [20:58:04] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for dbstore200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/548882 (owner: 10Papaul) [20:58:38] (03PS2) 10Dzahn: remove production IPs for cobalt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/547650 (https://phabricator.wikimedia.org/T236187) [20:59:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) [20:59:42] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) complete [21:00:04] mdholloway: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Enable MachineVision on testcommonswiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191105T2100). [21:00:50] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) 05Open→03Resolved [21:03:20] (03PS2) 10CDanis: rsync stunnel: enable for netmon1002 & /var/lib/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/548877 (https://phabricator.wikimedia.org/T237424) [21:03:22] (03PS1) 10CDanis: rsync stunnel: ferm support for both clear & TLS [puppet] - 10https://gerrit.wikimedia.org/r/548884 (https://phabricator.wikimedia.org/T237424) [21:08:12] (03PS1) 10Mholloway: MachineVision: Update filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/548886 (https://phabricator.wikimedia.org/T227349) [21:09:53] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-4/0/6 { - description graphite1001; - } `` [21:10:39] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) [21:14:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:14:45] (03PS1) 10Papaul: DNS: Remove mgmt DNS with graphite100[13] [dns] - 10https://gerrit.wikimedia.org/r/548888 [21:14:53] (03PS3) 10CDanis: rsync stunnel: enable for netmon1002 & /var/lib/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/548877 (https://phabricator.wikimedia.org/T237424) [21:14:55] (03PS1) 10CDanis: rsync stunnel: the client needs stunnel4 installed too [puppet] - 10https://gerrit.wikimedia.org/r/548889 (https://phabricator.wikimedia.org/T237424) [21:17:01] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS with graphite100[13] [dns] - 10https://gerrit.wikimedia.org/r/548888 (owner: 10Papaul) [21:18:29] (03CR) 10CDanis: [C: 03+2] rsync stunnel: ferm support for both clear & TLS [puppet] - 10https://gerrit.wikimedia.org/r/548884 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [21:18:35] (03CR) 10CDanis: [C: 03+2] rsync stunnel: the client needs stunnel4 installed too [puppet] - 10https://gerrit.wikimedia.org/r/548889 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [21:18:45] (03CR) 10CDanis: [C: 03+2] rsync stunnel: enable for netmon1002 & /var/lib/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/548877 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [21:18:52] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) [21:19:10] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) 05Open→03Resolved complete [21:21:18] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-4/0/25 { - description radon; - } [21:21:29] (03PS1) 10CDanis: rsync stunnel: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/548893 [21:21:38] (03CR) 10CDanis: [C: 03+2] rsync stunnel: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/548893 (owner: 10CDanis) [21:21:57] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) [21:24:09] (03PS1) 10Papaul: DNS: Remove mgmt DNS for radon [dns] - 10https://gerrit.wikimedia.org/r/548896 [21:25:09] (03PS3) 10Dzahn: cumin: add missing alias to cover labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/548539 (https://phabricator.wikimedia.org/T222061) [21:26:05] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for radon [dns] - 10https://gerrit.wikimedia.org/r/548896 (owner: 10Papaul) [21:27:12] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) [21:27:49] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Papaul) 05Open→03Resolved Complete [21:30:46] (03PS1) 10CDanis: rsync stunnel: need to enable in /etc/defaults, apparently [puppet] - 10https://gerrit.wikimedia.org/r/548898 (https://phabricator.wikimedia.org/T237424) [21:32:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces interface-range vlan-private1-c-eqiad] - member ge-5/0/39; [edit interfaces interface-range disabled] m... [21:33:45] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19265/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/548898 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [21:34:15] (03CR) 10Ottomata: "Hm, apparently this is not necessary like I thought it was:" [puppet] - 10https://gerrit.wikimedia.org/r/548850 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [21:36:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Papaul) ` papaul@asw2-d-eqiad# show | compare [edit interfaces] - ge-1/0/15 { - description db1068; - } [21:36:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Papaul) [21:37:57] (03PS1) 10CDanis: rsync stunnel: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/548901 [21:38:14] (03CR) 10CDanis: [V: 03+2 C: 03+2] rsync stunnel: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/548901 (owner: 10CDanis) [21:38:23] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Papaul) [21:38:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Papaul) [21:42:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Papaul) ` papaul@asw2-d-eqiad# show | compare [edit interfaces interface-range vlan-private1-d-eqiad] - member ge-1/0/17; [edit interfaces interface-range disabled] m... [21:44:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Papaul) [21:47:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Papaul) ` papaul@asw2-d-eqiad# show | compare [edit interfaces interface-range vlan-private1-d-eqiad] - member ge-1/0/18; [edit interfaces interface-range disabled] m... [21:47:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Papaul) [21:55:27] ok finally deploying the train to testwikis [21:55:41] (03PS1) 1020after4: testwikis wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548904 [21:55:43] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548904 (owner: 1020after4) [21:56:23] i'll reschedule my current deploy window for tomorrow morning (EST) since i think i need https://gerrit.wikimedia.org/r/548886 merged before i can proceed [21:56:32] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548904 (owner: 1020after4) [21:58:26] !log remove 127.0.0.1/32 and ::1/128 from cr3-esams:lo0.0 [21:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:50] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db1063,db1064,db1065,db1068,db1070 and db1071 [dns] - 10https://gerrit.wikimedia.org/r/548907 [22:00:18] (03PS1) 10CDanis: rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) [22:00:55] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db1063,db1064,db1065,db1068,db1070 and db1071 [dns] - 10https://gerrit.wikimedia.org/r/548907 (owner: 10Papaul) [22:02:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Papaul) [22:02:23] (03CR) 10jerkins-bot: [V: 04-1] rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:02:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Papaul) 05Open→03Resolved Complete [22:02:37] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [22:03:14] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Papaul) [22:03:36] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Papaul) 05Open→03Resolved Complete [22:03:39] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [22:03:49] !log remove 127.0.0.1/32 and ::1/128 from cr2-esams:lo0.0 [22:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Papaul) [22:04:38] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.5 refs T233853 [22:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:43] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [22:04:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Papaul) 05Open→03Resolved Complete [22:05:31] (03PS2) 10CDanis: rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) [22:05:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Papaul) [22:06:06] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [22:06:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Papaul) 05Open→03Resolved Complete [22:07:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Papaul) [22:07:13] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [22:07:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Papaul) 05Open→03Resolved Complete [22:07:27] (03CR) 10jerkins-bot: [V: 04-1] rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:08:01] (03PS3) 10CDanis: rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) [22:08:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Papaul) [22:09:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Papaul) 05Open→03Resolved Complete [22:09:09] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Papaul) [22:09:32] !log twentyafterfour@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_840646293" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 04m 54s) [22:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:34] (03CR) 10CDanis: [C: 03+2] rsync stunnel: always accept from localhost [puppet] - 10https://gerrit.wikimedia.org/r/548908 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:13:48] 10Operations, 10Patch-For-Review: stunnel-wrap all rsync::server usage - https://phabricator.wikimedia.org/T237424 (10CDanis) After far too many things that were subtly wrong, it works! {P9532} [22:15:38] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Jclark-ctr) @jhedden . Found host did not have ip address in it. Reentered address and mgnt password [22:17:43] !log scap failed with error: A copy of your installation's LocalSettings.php must exist and be readable in the source directory. Use --conf to specify it. refs T233853 [22:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:51] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [22:32:39] (03PS1) 10Jforrester: Explicitly set wgServer for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548917 [22:33:07] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) @Jclark-ctr thanks! confirmed that it's working from my end now. [22:35:29] (03CR) 1020after4: [C: 03+2] Explicitly set wgServer for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548917 (owner: 10Jforrester) [22:36:20] (03Merged) 10jenkins-bot: Explicitly set wgServer for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548917 (owner: 10Jforrester) [22:36:56] 10Operations, 10Patch-For-Review: stunnel-wrap all rsync::server usage - https://phabricator.wikimedia.org/T237424 (10CDanis) There's also some added logspam to be fixed: `Nov 05 22:30:01 netmon2001 CRON[6788]: pam_unix(cron:session): session opened for user root by (uid=0) Nov 05 22:30:01 netmon2001 CRON[679... [22:38:03] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.5 refs T233853 [22:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:08] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [22:39:30] !log twentyafterfour@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2076118383" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 01m 26s) [22:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:15] (03CR) 10Bstorm: [C: 03+2] toolforge: Remove old absented star.wmflabs.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/547364 (https://phabricator.wikimedia.org/T236952) (owner: 10Alex Monk) [22:44:24] (03CR) 10Bstorm: [C: 03+2] "Seems legit." [puppet] - 10https://gerrit.wikimedia.org/r/547364 (https://phabricator.wikimedia.org/T236952) (owner: 10Alex Monk) [22:50:29] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.5 refs T233853 [22:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:34] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [22:51:55] !log twentyafterfour@deploy1001 scap failed: CalledProcessError Command 'cp -r "/tmp/scap_l10n_2905573311"/* "/srv/mediawiki-staging/php-1.35.0-wmf.5/cache/l10n"' returned non-zero exit status 1 (duration: 01m 26s) [22:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:22] Well... [23:01:30] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.5 refs T233853 [23:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:35] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [23:03:44] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:30] (03PS5) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [23:16:45] (03PS6) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [23:20:21] 10Operations, 10ops-eqiad: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) Getting this error during initial OS installation Network autoconfiguration failed │ │ Your network is probably not using the DHCP protocol.... [23:21:02] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:21:05] (03PS2) 10Bstorm: cloud: Replace SSHSessions diamond collector with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/543268 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [23:22:26] (03PS1) 10Dzahn: raise wmgMemoryLimit from 660MB to 760MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) [23:24:46] (03CR) 10Subramanya Sastry: raise wmgMemoryLimit from 660MB to 760MB (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [23:25:43] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.5 refs T233853 (duration: 24m 13s) [23:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:49] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [23:29:35] (03PS4) 10Ayounsi: Initial netbox interfaces support [software/homer] - 10https://gerrit.wikimedia.org/r/547562 [23:29:47] (03PS6) 10Ayounsi: Intial interfaces templates [homer/public] - 10https://gerrit.wikimedia.org/r/547584 [23:30:04] (03CR) 10Dzahn: httpbb: Create a new Puppet module for httpbb. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [23:30:14] (03PS1) 10Alex Monk: toolforge: Reload/restart services as appropriate using new acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/548927 [23:30:16] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548928 [23:30:18] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548928 (owner: 1020after4) [23:31:11] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548928 (owner: 1020after4) [23:32:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Reload/restart services as appropriate using new acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/548927 (owner: 10Alex Monk) [23:32:50] (03CR) 10jerkins-bot: [V: 04-1] Initial netbox interfaces support [software/homer] - 10https://gerrit.wikimedia.org/r/547562 (owner: 10Ayounsi) [23:32:57] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.5 refs T233853 [23:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:01] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [23:34:14] (03Abandoned) 10Dzahn: cumin: add missing alias to cover labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/548539 (https://phabricator.wikimedia.org/T222061) (owner: 10Dzahn) [23:34:49] 10Operations, 10cloud-services-team: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 (10Dzahn) This is kind of a duplicate of T222061 it seems. [23:35:16] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Jclark-ctr) [23:35:56] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Jclark-ctr) [23:36:57] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Jclark-ctr) host racked and cabled. [23:37:37] (03CR) 10Dzahn: "> Cloud VPS are now running in Cloud VPS, so the labtestpuppetmaster >" [puppet] - 10https://gerrit.wikimedia.org/r/548539 (https://phabricator.wikimedia.org/T222061) (owner: 10Dzahn) [23:41:08] (03PS2) 10Alex Monk: toolforge: Reload/restart services as appropriate on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/548927 [23:41:21] 10Operations, 10cloud-services-team: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 (10Dzahn) @Andrew Can we remove labtestpuppetmaster / turn it into a spare::system ? [23:42:27] (03PS2) 10Krinkle: Raise PHP memory_limit from 660MB to 760MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [23:43:45] (03PS3) 10Dzahn: remove production IPs for cobalt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/547650 (https://phabricator.wikimedia.org/T236187) [23:44:00] (03CR) 10Krinkle: [C: 03+1] "We'll want to double check that this is acceptable across the fleet, but in general I'd recommend we do keep this applied to all app serve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [23:47:06] (03CR) 10Dzahn: [C: 03+2] remove production IPs for cobalt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/547650 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [23:48:39] (03CR) 10Gergő Tisza: [C: 03+2] "self-merge, beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548814 (owner: 10Gergő Tisza) [23:49:12] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) a:05Dzahn→03Jclark-ctr [23:49:24] (03Merged) 10jenkins-bot: Use production RESTBase in beta GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548814 (owner: 10Gergő Tisza) [23:49:44] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) 05Stalled→03Open p:05Normal→03Low [23:49:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [23:49:52] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage gerrit1001 and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) [23:49:59] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Krinkle) To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comments at 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Krinkle) a:03Dzahn [23:53:30] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) @wiki_willy Purchase date Dec. 4, 2015 Support contract — Support expiry date Dec. 5, 2018 ^ I guess this means we'll keep it around for another year or so in the spare pool. [23:54:24] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [23:55:21] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage gerrit1001 and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) 05Open→03Resolved gerrit1001 and gerrit2001 are on buster and cobalt has been shut down and remov... [23:55:23] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [23:55:27] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Switch on http/2 in apache for gerrit - https://phabricator.wikimedia.org/T180978 (10Dzahn) [23:58:27] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) >>! In T236833#5638686, @Krinkle wrote: > To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comme... [23:58:31] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10wiki_willy) @Dzahn - sounds good to me.