[00:00:56] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/WikiEditor/: EditAttemptStep: Allow overriding session ID (T238249) (duration: 00m 54s) [00:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:01] T238249: Propagate editing_session_id and oversampling flag from newcomer homepage to EditAttemptStep - https://phabricator.wikimedia.org/T238249 [00:01:59] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10tstarling) I haven't made a tarball for wikidiff2 before and I can't find any documentation of how that is meant to be done. It looks... [00:02:33] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/VisualEditor/: EditAttemptStep: Allow overriding session ID (T238249) (duration: 00m 52s) [00:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:07] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/GrowthExperiments/: Pass token as editing_session_id for suggested edits (T238249) (duration: 00m 53s) [00:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:11] T238249: Propagate editing_session_id and oversampling flag from newcomer homepage to EditAttemptStep - https://phabricator.wikimedia.org/T238249 [00:08:32] 10Operations, 10Commons, 10Multimedia, 10SRE-swift-storage: File not found - https://phabricator.wikimedia.org/T238695 (10crusnov) p:05Triage→03Normal [00:10:14] !log phab2001 - restart ssh-phab service after repooling it after buster reinstall, it wasn't listening on the IPv6 IP,causing LVS/pybal alerts [00:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:21] RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:15:14] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10tstarling) Never mind, I found https://www.mediawiki.org/wiki/Extension:Wikidiff2/Release_process [00:15:46] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) 19:10 < mutante> !log phab2001 - restart ssh-phab service after repooling it after buster reinstall, it wasn... [00:16:18] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [00:17:03] RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [00:17:41] ^ these are due to the service "git-ssh" on phab which almost nobody uses and a backend has been reinstalled [00:18:25] finally found out why that failed though.. it was listening only on v4 but not on v6 because puppet starts that before it adds the v6 IP on the interface on first run [00:18:57] so a service restart made it listen on v6 as well and that and repooling it meant the alerts are now going away [00:19:46] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [00:20:24] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10tstarling) Should be done now. [00:21:34] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) - phab1001 is on buster - phab2001 is now also on buster - next: declare maintenance window to switch prod f... [00:24:00] 10Operations, 10Analytics, 10Discovery, 10Recommendation-API: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10leila) [00:24:09] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) Technically this ticket is resolved. The "switch prod from phab1003 to phab1001" and "decom phab1003" are pr... [00:26:15] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10DannyS712) I just got 502 again on commons during a normal edit (`Request from [snip] via cp4030.ulsfo.wmnet, ATS/8.0.5... [00:26:59] 10Operations, 10Analytics, 10Discovery, 10Article-Recommendation, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10leila) [00:27:25] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) Thanks! Marked as active. [00:31:41] 10Operations, 10Analytics, 10serviceops-radar, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10leila) [00:34:54] (03PS1) 10Alex Monk: deployment-prep: Migrate to new logstash host [puppet] - 10https://gerrit.wikimedia.org/r/551946 (https://phabricator.wikimedia.org/T238707) [00:35:34] (03CR) 10Alex Monk: "(labs.yaml entry went to project-wide YAML in https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/31279f370ab1b8f4cdab3" [puppet] - 10https://gerrit.wikimedia.org/r/551946 (https://phabricator.wikimedia.org/T238707) (owner: 10Alex Monk) [00:41:45] (03CR) 10Dzahn: [C: 03+1] deployment-prep: Migrate to new logstash host [puppet] - 10https://gerrit.wikimedia.org/r/551946 (https://phabricator.wikimedia.org/T238707) (owner: 10Alex Monk) [01:58:29] (03CR) 10Andrew Bogott: [C: 03+2] deployment-prep: Migrate to new logstash host [puppet] - 10https://gerrit.wikimedia.org/r/551946 (https://phabricator.wikimedia.org/T238707) (owner: 10Alex Monk) [02:02:56] (03CR) 10Vgutierrez: [C: 04-1] TLS Analytics: make parsing more robust (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551840 (owner: 10BBlack) [02:18:10] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [02:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:37] PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fa:22, 208.80.153.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [02:25:37] PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fa:22, 208.80.153.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [02:26:51] (03CR) 10Vgutierrez: [C: 03+1] ATS: allow DELETE requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551849 (https://phabricator.wikimedia.org/T238540) (owner: 10Ema) [02:28:14] mutante: ^^ that seems triggered by your depool [02:30:03] 10Operations, 10OTRS, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) - https://phabricator.wikimedia.org/T173894 (10Shizhao) [02:31:09] vgutierrez: unfortunately there is a different alert when it's pooled [02:31:33] so i just wanted to depool it again to be able to keep checking tomorrow [02:32:54] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [02:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:35] RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:34:35] RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:41:03] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) I don't think so, at 00:25 GMT we had 75 requests (including yours) failing against appservers-rw.discovery.... [02:54:17] !log restarting pybal on lvs2005 [02:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:33] RECOVERY - PyBal backends health check on lvs2005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:56:47] here we go :) ^^ mutante [02:56:53] vgutierrez: thank you :) [02:57:00] so a glitch [02:57:08] it drove me crazy because there were 2 unrelated issues [02:57:28] !log restarting pybal on lvs2002 [02:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:31] RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:58:40] first the v6 backend was actually down.. but when that was fixed pybal would not realize that v4 was up and it totally worked and as you pointed out too the monitor command worked [03:16:02] !log T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350 [03:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:07] T208369: Welcome survey: anonymize data after one year - https://phabricator.wikimedia.org/T208369 [03:24:00] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10crusnov) [03:25:03] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10crusnov) 05Open→03Resolved >>! In T224946#5675019, @faidon wrote: > What's the latest here? Please keep the task updated :) * Contact group is in place and worki... [03:35:09] 10Operations, 10observability: Make contact group for Netbox report alerts - https://phabricator.wikimedia.org/T230725 (10crusnov) 05Open→03Resolved Thanks much Daniel, and I have completed this proccess. Closing. [03:57:13] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) You welcome [04:07:00] (03PS1) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) [04:08:13] (03CR) 10CRusnov: [C: 04-1] "This is normially a WIP. We need to agree on the domain name of course also, and it pre-requires that to be setup in the DNS repository be" [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [04:10:11] (03CR) 10CRusnov: "> Patch Set 2:" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) (owner: 10CRusnov) [04:10:11] (03CR) 10jerkins-bot: [V: 04-1] netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [04:12:50] (03CR) 10CRusnov: cables: detect duplicate cable names, and blank cable names (037 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [04:57:23] 10Operations, 10Traffic: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [05:04:59] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:08] (03PS1) 10Vgutierrez: librenms: Reject plain text requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/551950 (https://phabricator.wikimedia.org/T238720) [05:09:17] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Bugreporter) Note very old browsers may not support HSTS preload list or even HSTS itself; probably we want to configure a specific 403 message (... [05:21:01] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) we are targeting here the one-off sites, some of them are already configured to support TLSv1.2 only, that's usually a stricter requi... [05:22:00] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) Just to be clear, wikipedia and the rest of the canonical sites are out of scope for this task :) [05:22:22] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) p:05Triage→03Normal [05:27:49] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:29:57] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:31:37] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:32:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:47:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092 after schema change', diff saved to https://phabricator.wikimedia.org/P9679 and previous config saved to /var/cache/conftool/dbconfig/20191120-054840-marostegui.json [05:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:36] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/551952 [05:54:27] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1105:3311 db1097:3314 db1098:3316 db1098:3317 after compression', diff saved to https://phabricator.wikimedia.org/P9680 and previous config saved to /var/cache/conftool/dbconfig/20191120-055426-marostegui.json [05:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:47] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/551952 (owner: 10Marostegui) [05:55:23] !log Depool labsdb1011 for upgrade [05:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 and db1101:3318 for upgrade and schema change', diff saved to https://phabricator.wikimedia.org/P9681 and previous config saved to /var/cache/conftool/dbconfig/20191120-055732-marostegui.json [05:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:19] !log Stop MySQL on db1101:3317, db1101:3318 for upgrade and schema change [05:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:53] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:59] ^ expected [06:06:23] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:06:27] ^ same [06:06:31] PROBLEM - Host labsdb1011 is DOWN: PING CRITICAL - Packet loss = 100% [06:07:23] RECOVERY - Host labsdb1011 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [06:09:45] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:47] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:15:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:16:21] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:17:45] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:17:45] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:19:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3314 for compression', diff saved to https://phabricator.wikimedia.org/P9682 and previous config saved to /var/cache/conftool/dbconfig/20191120-061938-marostegui.json [06:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3317 after upgrade', diff saved to https://phabricator.wikimedia.org/P9683 and previous config saved to /var/cache/conftool/dbconfig/20191120-062029-marostegui.json [06:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136 for upgrade', diff saved to https://phabricator.wikimedia.org/P9684 and previous config saved to /var/cache/conftool/dbconfig/20191120-062749-marostegui.json [06:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:16] !log Upgrade db1136 [06:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136 after upgrade', diff saved to https://phabricator.wikimedia.org/P9685 and previous config saved to /var/cache/conftool/dbconfig/20191120-063628-marostegui.json [06:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:57] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:04] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/551954 [06:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1136 into s7 api', diff saved to https://phabricator.wikimedia.org/P9686 and previous config saved to /var/cache/conftool/dbconfig/20191120-064022-marostegui.json [06:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:29] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/551954 (owner: 10Marostegui) [06:41:04] !log Repool labsdb1011 [06:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:35] !log Upgrade db2118 (s7 codfw master) [06:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:59] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Joe) I don't think the solution is removing aphlict, but instead proxying to it directly from envoy or ATS, our choice. @ema @Dz... [06:54:56] 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10Joe) 05Open→03Resolved [06:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1136', diff saved to https://phabricator.wikimedia.org/P9687 and previous config saved to /var/cache/conftool/dbconfig/20191120-065718-marostegui.json [06:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:47] The BGP/OSPF alerts are due to the Telia transport between cr1 eqiad/codfw, that seems flapping.. I see maintenance scheduled and cancelled, possibly some issue is still up? [07:04:45] 10Operations, 10serviceops: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10Joe) I'm confused. The hot restarter is the default since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536149/ We just need the cert to notify the envoy service. [07:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1136', diff saved to https://phabricator.wikimedia.org/P9688 and previous config saved to /var/cache/conftool/dbconfig/20191120-070511-marostegui.json [07:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:41] ah no the cancelled event was for another link [07:06:58] the one that it is currently ongoing matches with the alerts [07:07:01] should be good [07:13:57] (03CR) 10Elukey: "Hey, anything that is still blocking this? After a chat with Gehel I thought it would have been merged, just checking if you folks are wai" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [07:13:59] !log Deploy schema change on s3 (testwikidatawiki) directly on s3 primary master T237120 [07:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:05] T237120: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 [07:19:07] !log Upgrade db2078 [07:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:45] (03PS1) 10Elukey: Add hdfs kerberos keytab to Analytics Hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/551957 (https://phabricator.wikimedia.org/T237269) [07:19:55] !log Upgrade db2062 [07:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:10] (03CR) 10Elukey: [C: 03+2] Add hdfs kerberos keytab to Analytics Hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/551957 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [07:27:55] (03PS1) 10Giuseppe Lavagetto: tlsproxy::envoy: restart envoy when certificates change [puppet] - 10https://gerrit.wikimedia.org/r/551958 (https://phabricator.wikimedia.org/T238597) [07:28:27] (03PS1) 10Marostegui: db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551959 (https://phabricator.wikimedia.org/T238183) [07:29:29] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:29:38] ^ me [07:30:38] (03CR) 10Marostegui: [C: 03+2] db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551959 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:31:09] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:32:37] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add systemd env variables [puppet] - 10https://gerrit.wikimedia.org/r/551960 [07:35:43] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: add systemd env variables [puppet] - 10https://gerrit.wikimedia.org/r/551960 (owner: 10Elukey) [07:36:05] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) [07:36:49] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) [07:36:54] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10DannyS712) Again when trying to edit on species wiki: `Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502,... [07:37:40] (03PS1) 10Marostegui: mariadb: Promote db2132 to m1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/551961 (https://phabricator.wikimedia.org/T238183) [07:41:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2132 to m1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/551961 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:42:26] (03CR) 10Giuseppe Lavagetto: "Seems to work fine https://puppet-compiler.wmflabs.org/compiler1001/19492/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/551958 (https://phabricator.wikimedia.org/T238597) (owner: 10Giuseppe Lavagetto) [07:43:02] !log Promote db2132 as m1-codfw master - T238183 [07:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:08] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [07:48:59] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: fix timer commands [puppet] - 10https://gerrit.wikimedia.org/r/551963 [07:50:21] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: fix timer commands [puppet] - 10https://gerrit.wikimedia.org/r/551963 (owner: 10Elukey) [07:56:49] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "New cronjobs should use profile::mediawiki::periodic_job. I'll amend the patch for you." [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [08:01:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [08:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:14] (03PS3) 10Elukey: kerberos: add syslog logging to kerberos-run-command.py [puppet] - 10https://gerrit.wikimedia.org/r/551794 (https://phabricator.wikimedia.org/T238306) [08:06:46] (03PS1) 10Marostegui: mariadb: Remove db2061 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551968 (https://phabricator.wikimedia.org/T238526) [08:08:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2061 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551968 (https://phabricator.wikimedia.org/T238526) (owner: 10Marostegui) [08:08:54] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2061 [dns] - 10https://gerrit.wikimedia.org/r/551969 (https://phabricator.wikimedia.org/T238526) [08:09:53] (03PS1) 10Muehlenhoff: Removed LDAP access for nharateh [puppet] - 10https://gerrit.wikimedia.org/r/551970 [08:10:17] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2061 [dns] - 10https://gerrit.wikimedia.org/r/551969 (https://phabricator.wikimedia.org/T238526) (owner: 10Marostegui) [08:11:37] 10Operations, 10ops-codfw, 10decommission: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Marostegui) a:05Marostegui→03Papaul [08:12:03] 10Operations, 10ops-codfw, 10decommission: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Marostegui) Host ready for @Papaul to take over [08:12:53] (03CR) 10Elukey: [C: 03+2] kerberos: add syslog logging to kerberos-run-command.py [puppet] - 10https://gerrit.wikimedia.org/r/551794 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [08:14:45] (03CR) 10Mobrovac: [C: 03+1] [Beta] Use Parsoid/PHP for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [08:16:25] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:18:30] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:19:43] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [08:22:43] (03CR) 10Muehlenhoff: [C: 03+2] Removed LDAP access for nharateh [puppet] - 10https://gerrit.wikimedia.org/r/551970 (owner: 10Muehlenhoff) [08:25:55] (03PS5) 10Giuseppe Lavagetto: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [08:26:35] (03PS4) 10Giuseppe Lavagetto: role::debug_proxy: remove old/unused aliases, depool mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/549895 [08:29:05] (03CR) 10jerkins-bot: [V: 04-1] mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [08:29:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::debug_proxy: remove old/unused aliases, depool mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/549895 (owner: 10Giuseppe Lavagetto) [08:30:49] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll boldly resolv... [08:31:30] <_joe_> moritzm: are you merging my change too? [08:31:35] will do [08:31:46] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10akosiaris) 05Open→03Resolved krypton is no more since 7a36b4e7a94f486a400f0363c263c446c33bba80, resolving. [08:31:48] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10akosiaris) [08:31:56] <_joe_> ack [08:32:09] done [08:32:14] <_joe_> thanks [08:33:12] (03PS1) 10Urbanecm: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) [08:33:49] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [08:39:41] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:46] (03CR) 10Giuseppe Lavagetto: kubernetes::deployment_server: Add a private/general.yaml file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [08:43:47] (03PS2) 10Urbanecm: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) [08:45:36] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) This is what I got so far (only per-job information so far): {P9689} This has the data of the last good, and last good full backup, and the m... [08:48:16] (03CR) 10Giuseppe Lavagetto: prometheus: add scraping of k8s envoy sidecars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549871 (owner: 10Giuseppe Lavagetto) [08:49:04] (03PS3) 10Urbanecm: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) [08:49:41] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [08:51:12] (03PS4) 10Urbanecm: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) [08:54:32] (03PS2) 10Giuseppe Lavagetto: prometheus: add scraping of k8s envoy sidecars [puppet] - 10https://gerrit.wikimedia.org/r/549871 (https://phabricator.wikimedia.org/T237234) [08:54:34] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server: Add a private/general.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) [08:54:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094 for upgrade', diff saved to https://phabricator.wikimedia.org/P9690 and previous config saved to /var/cache/conftool/dbconfig/20191120-085448-marostegui.json [08:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:09] !log Upgrade db1094 [08:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10fgiunchedi) [09:04:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC at https://puppet-compiler.wmflabs.org/compiler1003/329/ says this looks file (I 've looked at the errors, this commit is not to blame" [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:05:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] netbox: Expose automated DNS repository for web access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [09:06:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094 after upgrade', diff saved to https://phabricator.wikimedia.org/P9691 and previous config saved to /var/cache/conftool/dbconfig/20191120-090739-marostegui.json [09:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:51] (03PS1) 10Elukey: Add fake kerberos keytabs for Hadoop-related hosts [labs/private] - 10https://gerrit.wikimedia.org/r/552013 [09:07:53] (03PS14) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [09:08:04] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for Hadoop-related hosts [labs/private] - 10https://gerrit.wikimedia.org/r/552013 (owner: 10Elukey) [09:09:19] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202994s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [09:11:29] (03PS14) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [09:11:31] (03PS4) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [09:11:33] (03PS1) 10Jcrespo: bacula-check: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/552014 [09:11:35] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10mobrovac) 05Resolved→03Open This doesn't seem to be working as expected. On the client, I always get `Server: envoy`: ` $ curl https://test.wikipedia.org/api/rest_v1/page/html/Testparso... [09:11:37] (03PS1) 10Elukey: Add kerberos fake keytabs for analytics hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/552015 [09:12:00] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] bacula-check: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/552014 (owner: 10Jcrespo) [09:12:10] (03PS2) 10Jcrespo: bacula-check: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/552014 [09:12:25] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add kerberos fake keytabs for analytics hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/552015 (owner: 10Elukey) [09:13:28] (03PS1) 10Urbanecm: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) [09:13:41] (03CR) 10jerkins-bot: [V: 04-1] bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:14:55] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] bacula-check: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/552014 (owner: 10Jcrespo) [09:15:55] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [09:16:48] (03PS4) 10Urbanecm: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [09:18:49] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) From ATS source code, it looks like ATS logs connect errors on the first attempt but it's configured to perf... [09:21:02] (03PS15) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [09:21:04] (03PS5) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [09:23:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1094 after upgrade', diff saved to https://phabricator.wikimedia.org/P9692 and previous config saved to /var/cache/conftool/dbconfig/20191120-092337-marostegui.json [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:11] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [09:27:17] (03CR) 10Ema: [C: 03+2] ATS: allow DELETE requests [puppet] - 10https://gerrit.wikimedia.org/r/551849 (https://phabricator.wikimedia.org/T238540) (owner: 10Ema) [09:33:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1094 after upgrade', diff saved to https://phabricator.wikimedia.org/P9693 and previous config saved to /var/cache/conftool/dbconfig/20191120-093308-marostegui.json [09:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:53] (03PS1) 10Urbanecm: Initial configuration for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) [09:41:09] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Ok, a bit of digging: `lang=bash # From the public internet $ for dc in eqiad codfw esams eqsin ulsfo; do echo -n "$dc: "; curl --resolve test.wikipedia.org:443:$(dig +short text-lb.$... [09:47:14] !log Compress dbstore1004:3314 [09:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:18] !log Compress dbstore1005:3318 [09:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:07] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Not very misteriously, the edges use ATS-BE so they call envoy, while the main dcs are still contacting restbase directly. Meh. [09:52:21] !log mobrovac@deploy1001 Started deploy [restbase/deploy@c677063]: Switch test2.wp back to Parsoid/JS temporarily - T238716 [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:26] T238716: VisualEditor does not see previously saved content on testwiki - https://phabricator.wikimedia.org/T238716 [09:55:13] (03PS1) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) [09:55:52] (03CR) 10jerkins-bot: [V: 04-1] Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [09:56:42] !log Compress db2106 [09:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:54] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) And indeed it seems things are not working as expected: ` restbase2015:~$ curl restbase2015:7231/de.wikipedia.org/v1/page/references/Der_Junge_mit_dem_gro%C3%9Fen_schwarzen_Hund -Is |... [09:57:21] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) I find this pretty worrisome for the following reasons: # right now we have one remap rule that catches all... [09:57:49] 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10Joe) In the meantime, we have a security release 1.12.1 - I will build it and upload it to stretch and buster. [09:58:37] 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10Joe) [09:58:39] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) [09:59:06] (03PS2) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) [09:59:50] (03CR) 10jerkins-bot: [V: 04-1] Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [10:03:01] (03CR) 10Arturo Borrero Gonzalez: cloudservices: move from pdns3 to pdns4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [10:05:16] (03PS3) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) [10:07:15] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@c677063]: Switch test2.wp back to Parsoid/JS temporarily - T238716 (duration: 14m 54s) [10:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:21] T238716: VisualEditor does not see previously saved content on testwiki - https://phabricator.wikimedia.org/T238716 [10:08:18] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Joe) >>! In T237319#5677665, @Vgutierrez wrote: > I find this pretty worrisome for the following reasons: > # right now... [10:08:56] (03PS2) 10Gehel: Repoint mjolnir daemons at deploy directory [puppet] - 10https://gerrit.wikimedia.org/r/540960 (owner: 10EBernhardson) [10:11:06] (03PS16) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [10:11:08] (03PS6) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [10:11:10] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [10:11:21] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) nope, a 5xx doesn't translate to `BAD_INCOMING_RESPONSE`, actually is specifically whitelisted: `lang=C++ ca... [10:12:51] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:12:59] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:13:15] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) [10:13:40] (03CR) 10Gehel: [C: 03+2] Repoint mjolnir daemons at deploy directory [puppet] - 10https://gerrit.wikimedia.org/r/540960 (owner: 10EBernhardson) [10:14:23] !log Compress db2095:3314 [10:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:45] (03CR) 10jerkins-bot: [V: 04-1] prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [10:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P9694 and previous config saved to /var/cache/conftool/dbconfig/20191120-101727-marostegui.json [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Joe) >>! In T237319#5677739, @Vgutierrez wrote: > nope, a 5xx doesn't translate to `BAD_INCOMING_RESPONSE`, actually is specifically whitelisted: > `lang=C++ > case STATUS_CODE_SERVER_ERROR: >... [10:20:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [10:21:24] (03CR) 10Ema: [C: 03+1] tlsproxy::envoy: restart envoy when certificates change [puppet] - 10https://gerrit.wikimedia.org/r/551958 (https://phabricator.wikimedia.org/T238597) (owner: 10Giuseppe Lavagetto) [10:21:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tlsproxy::envoy: restart envoy when certificates change [puppet] - 10https://gerrit.wikimedia.org/r/551958 (https://phabricator.wikimedia.org/T238597) (owner: 10Giuseppe Lavagetto) [10:22:30] _joe_: https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/26388/console fails the style check :/ [10:22:34] * addshore isnt sure why exactly [10:22:53] <_joe_> addshore: I know, damn me :P [10:22:54] !log mobrovac@deploy1001 Started deploy [restbase/deploy@daa7808]: Revert switching test2.wp to Parsoid/JS - T238716 [10:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:59] T238716: VisualEditor does not see previously saved content on testwiki - https://phabricator.wikimedia.org/T238716 [10:23:02] :D [10:23:21] <_joe_> addshore: I'll try to merge that today [10:23:39] that would be great! :) [10:25:18] <_joe_> my team needs to convert all crons to profile::mediawiki::periodic_job btw [10:25:42] (03PS1) 10Ema: cache: reimage cp2010 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552031 (https://phabricator.wikimedia.org/T227432) [10:25:43] <_joe_> it's at least something vaguely resembling a software solution [10:27:07] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [10:30:08] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:30:27] !log Upgrade db1116 [10:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:22] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:56] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [10:34:12] !log depool cp2010 and reimage as text_ats T227432 [10:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:17] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:34:37] (03CR) 10Ema: [C: 03+2] cache: reimage cp2010 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552031 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:35:53] (03PS3) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [10:36:50] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@daa7808]: Revert switching test2.wp to Parsoid/JS - T238716 (duration: 13m 56s) [10:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:55] T238716: VisualEditor does not see previously saved content on testwiki - https://phabricator.wikimedia.org/T238716 [10:36:55] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2010.codfw.wmnet'] ` The log can be found in `/var/log/wm... [10:37:25] (03CR) 10Volans: [C: 03+1] "LGTM, let's test it" [puppet] - 10https://gerrit.wikimedia.org/r/551950 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:38:33] \o/ let's break stuff [10:38:42] (03CR) 10Vgutierrez: [C: 03+2] librenms: Reject plain text requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/551950 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:39:42] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (208345s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [10:40:22] that's a interesting timing, but it wasn't me [10:40:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack Neutron: Remove no-longer-supported min_l3_agents_per_router [puppet] - 10https://gerrit.wikimedia.org/r/550504 (owner: 10Andrew Bogott) [10:42:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "the WMCS part LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/551534 (owner: 10Jbond) [10:43:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550503 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [10:48:01] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) > One real problem is: some of our requests are incredibly long and might overflow the timeouts in ATS-BE - in that case, pybal won't depool a backend but we might still get an error... [10:49:28] vgutierrez: o/ - qq: why don't just remove the vhost? Also I am wondering if that works for files too (usually I have always seen a doc root set + directory with require all denied) [10:50:15] elukey: so if I currently curl "http://librenms.wikimedia.org/graph.php?height=273&width=586&to=1574233500&id=8308&type=port_errors&from=1573628700" I get a 403 [10:50:22] and before applying the change I got a 301 to https [10:50:40] elukey: it's just a previous step.. to see if we break anything [10:50:50] ahhh okok [10:50:57] elukey: usually an immediate 403 is better and faster than a socket timeout [10:51:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I acknowledge we don't have a better workflow right now than copying the directory trees. But the truth is that most of the code (scripts," [puppet] - 10https://gerrit.wikimedia.org/r/550502 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [10:51:42] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1083:9536,cp1085:9536,cp1089:9536} site=eqiad tunnel={cp2010_v4,cp2010_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:52:05] that's ema :) [10:52:15] vgutierrez: agreed, just asked since the config looked strange, that's it [10:52:21] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [10:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:43] there is also no docroot and the default (/usr/local/apache/htdocs) is not there, so ok too [10:53:03] anyway, I was just curious, thanks :) [10:54:32] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:57] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) @Joe on the other hand ATS doesn't reap connections for hosts marked as down, and because ats-be uses KA it should have plenty of available connections against appserver-rw.discovery.... [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] I'll do my own SWAT. [11:00:27] Heads-up: Full scap will be required. [11:01:53] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [11:02:04] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10Mathew.onipe) [11:04:10] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2010.codfw.wmnet'] ` and were **ALL** successful. [11:04:17] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:05:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 33, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:05:53] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:05:55] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:06:03] * jynus when Urbanecm has to do its own scap: https://youtu.be/Bu_hw963AG4?t=12 [11:06:09] *his [11:06:40] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10Gehel) @Mathew.onipe needs to be a memeber of... [11:07:44] (03PS4) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [11:10:26] (03CR) 10Volans: [C: 04-1] "Missing file?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [11:10:56] LOL [11:15:06] (03PS3) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [11:15:32] !log pool cp2010 with ATS backend T227432 [11:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:17:55] (03PS7) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [11:18:11] (03PS4) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [11:18:22] (03PS5) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [11:19:49] (03PS1) 10Ema: cache: reimage cp2012 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552047 (https://phabricator.wikimedia.org/T227432) [11:21:04] !log depool cp2012 and reimage as text_ats T227432 [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:09] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:21:18] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/19498/backup1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:23:43] (03CR) 10Ema: [C: 03+2] cache: reimage cp2012 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552047 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:24:39] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2012.codfw.wmnet'] ` The log can be found in `/var/log/wm... [11:24:52] (03CR) 10Jbond: [C: 03+2] admin: add ezachte ssh key [puppet] - 10https://gerrit.wikimedia.org/r/551836 (https://phabricator.wikimedia.org/T215790) (owner: 10Jbond) [11:25:03] (03PS2) 10Jbond: admin: add ezachte ssh key [puppet] - 10https://gerrit.wikimedia.org/r/551836 (https://phabricator.wikimedia.org/T215790) [11:25:05] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1001/19499/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:25:58] PROBLEM - traffic_server backend process restarted on cp2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2010&var-layer=backend [11:27:04] (03PS2) 10ArielGlenn: add new partman recipe that skips format of /data partition for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/551503 (https://phabricator.wikimedia.org/T224563) [11:27:46] !log cp2010: ats-backend-restart to clear backend restart alert [11:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:05] 10Operations, 10SRE-tools, 10netbox: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10faidon) This is an excerpt of the backlog overnight: ` 01:17 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedi... [11:29:50] RECOVERY - traffic_server backend process restarted on cp2010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2010&var-layer=backend [11:30:00] !log urbanecm@deploy1001 Started scap: SWAT: 44ec4e4: e1baf0e: 3c02aa7: Namespace changes [11:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:58] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2012_v4,cp2012_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:31:03] (03CR) 10Jcrespo: "So I was thinking on merging this early, when metrics are not yet definitive (even if there are 18 metrics per job * 95 configured jobs) a" [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:31:36] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2012_v4,cp2012_v6} Ema reimaging 2012 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:32:58] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10Arjunaraoc) @MSantos thanks for your update. I am happy to know that you are working on this as a main priority. [11:34:20] (03CR) 10Urbanecm: [C: 03+2] Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [11:34:29] (03PS5) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) [11:34:35] (03CR) 10Urbanecm: [C: 03+2] Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [11:35:23] (03Merged) 10jenkins-bot: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [11:36:15] !log urbanecm@deploy1001 Finished scap: SWAT: 44ec4e4: e1baf0e: 3c02aa7: Namespace changes (duration: 06m 15s) [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: f847380: Set namespace alias for Index: (NS 102/103) for elwikisource (T237253) (duration: 00m 54s) [11:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:41] T237253: Rename el.wikisource namespace "Βιβλίο" to "Μεταγραφή" - https://phabricator.wikimedia.org/T237253 [11:40:04] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [11:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:14] (03Abandoned) 10ArielGlenn: add partman recipe that leaves /data on dump servers alone [puppet] - 10https://gerrit.wikimedia.org/r/551879 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [11:41:54] (03PS3) 10ArielGlenn: add new partman recipe that skips format of /data partition for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/551503 (https://phabricator.wikimedia.org/T224563) [11:42:10] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:50] (03PS6) 10Urbanecm: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [11:43:56] (03CR) 10Urbanecm: [C: 03+2] "noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [11:44:44] (03Merged) 10jenkins-bot: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [11:46:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 51ecd71: Partial cleanup of InitializeSettings (T231178) (duration: 00m 52s) [11:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:20] T231178: General cleanup of initialize settings - https://phabricator.wikimedia.org/T231178 [11:47:05] (03CR) 10ArielGlenn: [C: 03+2] add new partman recipe that skips format of /data partition for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/551503 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [11:50:40] (03PS2) 10Urbanecm: [rowiki] Enable 'deleterevision' for patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539553 (https://phabricator.wikimedia.org/T234051) (owner: 10Strainu) [11:51:00] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) Renaming MIDDLEWARE_CLASSES to MIDDLEWARE seems to have fixed this problem. [11:51:12] 10Operations, 10Traffic, 10Wikidata, 10observability, 10User-Addshore: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10Addshore) 05Open→03Resolved a:03Addshore [11:51:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT: Legal said it is fine if all accidentals undeletions are auto-reported to admins, who can revoke the right." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539553 (https://phabricator.wikimedia.org/T234051) (owner: 10Strainu) [11:51:59] (03Merged) 10jenkins-bot: [rowiki] Enable 'deleterevision' for patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539553 (https://phabricator.wikimedia.org/T234051) (owner: 10Strainu) [11:53:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2012.codfw.wmnet'] ` and were **ALL** successful. [11:55:34] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:55:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 2b13fbe: [rowiki] Enable deleterevision for patrollers (T234051) (duration: 00m 52s) [11:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:46] T234051: Patroller right changes to ro.wp - https://phabricator.wikimedia.org/T234051 [11:55:48] !log EU SWAT done [11:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:59] !log pool cp2012 with ATS backend T227432 [11:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:04] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:02:42] (03PS1) 10Ema: cache: reimage cp2013 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552058 (https://phabricator.wikimedia.org/T227432) [12:06:27] (03PS1) 10Arturo Borrero Gonzalez: openstack: bootstrap ocata puppet code for servers [puppet] - 10https://gerrit.wikimedia.org/r/552059 (https://phabricator.wikimedia.org/T237749) [12:08:37] !log depool cp2013 and reimage as text_ats T227432 [12:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:48] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:08:54] (03CR) 10Ema: [C: 03+2] cache: reimage cp2013 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552058 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:10:08] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2013.codfw.wmnet'] ` The log can be found in `/var/log/wm... [12:12:08] Urbanecm, Amir1, will new wikis be created today? [12:12:19] yup [12:12:19] Jhs: we hope so :) [12:12:25] sweet [12:12:47] all four (or five, with Georgian User Group), or "just" one or two? [12:13:23] the plan is for all four [12:13:32] Amir1, <3 [12:13:47] Amir1: we have five wikis pending, one is non-content [12:14:13] i thought one of them is not ready yet [12:14:18] but I might be wrong [12:14:34] Amir1: I think all of them are ready, I'll check one more through [12:14:53] Thanks [12:17:46] Urbanecm, you saw my comment on WikimediaMessages btw? (it's not really an urgent issue though) [12:18:02] Jhs: yes, I've uploaded a follow-up patch [12:18:08] Urbanecm, awesome [12:18:56] (03CR) 10Jon Harald Søby: [C: 04-1] "The project name should be in InitialiseSettings.php as well. Wikasegzawal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) (owner: 10Urbanecm) [12:20:28] (03PS2) 10Jon Harald Søby: Initial configuration for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) (owner: 10Urbanecm) [12:21:48] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2013_v4,cp2013_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:22:10] (03PS2) 10Jon Harald Søby: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [12:22:11] 10Operations, 10Gerrit-Privilege-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10MarcoAurelio) [12:23:11] thx Jhs [12:23:21] np Urbanecm [12:24:09] the WikimediaMessages patch is fixed btw [12:24:14] (hi) [12:24:17] thx hauskater [12:24:21] (03PS5) 10Jon Harald Søby: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [12:25:25] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [12:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:33] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:20] (03CR) 10Volans: [C: 04-1] "Some minor nits, some bikeshedding and an error, see inline for details." (035 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [12:33:48] (03PS2) 10BBlack: TLS Analytics: make parsing more robust [puppet] - 10https://gerrit.wikimedia.org/r/551840 [12:34:43] (03CR) 10Effie Mouzeli: "> LGTM, are there equivalent metrics for php-fpm (or needed?)" [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:36:19] (03CR) 10Effie Mouzeli: [C: 03+2] prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:36:49] (03PS3) 10Effie Mouzeli: prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) [12:38:15] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2013.codfw.wmnet'] ` and were **ALL** successful. [12:41:15] (03PS2) 10Effie Mouzeli: logstash: remove HHVM references [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) [12:43:18] !log pool cp2013 with ATS backend T227432 [12:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:24] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:54:12] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/19507/" [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [12:54:23] (03PS2) 10DCausse: Revert "Revert "Add L and M to allowed statement starts"" [puppet] - 10https://gerrit.wikimedia.org/r/547541 (https://phabricator.wikimedia.org/T222321) [12:54:25] (03PS3) 10DCausse: Support /entity/ and other Wikidata URLs for Commons [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [12:54:27] (03PS1) 10DCausse: [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) [12:55:54] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:56:44] (03CR) 10Effie Mouzeli: [C: 03+2] logstash: remove HHVM references [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:59:24] (03PS2) 10DCausse: [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) [13:01:26] jouncebot: next [13:01:26] In 4 hour(s) and 58 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T1800) [13:01:32] jouncebot: now [13:01:32] No deployments scheduled for the next 4 hour(s) and 58 minute(s) [13:02:23] lunch, bbl [13:02:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [wikidata] provide better link to statement information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [13:06:48] (03CR) 10Muehlenhoff: Add image submission mode to debmonitor client (034 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [13:07:08] (03PS1) 10Filippo Giunchedi: puppetmaster: update progress during facts export [puppet] - 10https://gerrit.wikimedia.org/r/552062 [13:08:55] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [13:19:52] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) 05Resolved→03Open [13:19:55] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [13:24:21] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) Either way when switching the Hiera key it should either enable or disable all the things and not just some of them. A... [13:25:25] (03CR) 10Filippo Giunchedi: "Only glanced at the exporter review in Ic0b9c3330 but see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:32:21] !log updated puppet compiler facts on compiler100* hosts [13:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:48] (03PS1) 10Elukey: superset: add ProxyPreserveHost in httpd to avoid broken report generation [puppet] - 10https://gerrit.wikimedia.org/r/552065 (https://phabricator.wikimedia.org/T238461) [13:35:21] (03CR) 10Elukey: [C: 03+2] superset: add ProxyPreserveHost in httpd to avoid broken report generation [puppet] - 10https://gerrit.wikimedia.org/r/552065 (https://phabricator.wikimedia.org/T238461) (owner: 10Elukey) [13:39:12] _joe_: any joy with that puppet change? :) [13:39:38] <_joe_> addshore: I'll get to it in a few, I need to start repackaging envoy fiuurst [13:40:01] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10Dzahn) @DannyS712 I don't think we should even add "Patch-For-Review" in the first place, i am not aware of a single time getting a review be... [13:40:30] ack [13:45:28] (03PS1) 10Marostegui: db2062: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552066 (https://phabricator.wikimedia.org/T238726) [13:47:33] (03CR) 10Marostegui: [C: 03+2] db2062: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552066 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui) [13:50:49] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10MarcoAurelio) >>! In T230110#5678220, @Dzahn wrote: > @DannyS712 I don't think we should even add "Patch-For-Review" in the first place, i am... [13:52:43] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by... [13:53:33] (03CR) 10CDanis: [C: 03+1] puppetmaster: update progress during facts export [puppet] - 10https://gerrit.wikimedia.org/r/552062 (owner: 10Filippo Giunchedi) [13:55:00] 10Operations, 10Commons, 10SRE-swift-storage: File not found - https://phabricator.wikimedia.org/T238695 (10Aklapper) [13:55:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) followed up with dell regarding results spoke with Madhusudan.Rao@dell.com on phone. he will follow up with K... [13:56:18] (03PS15) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [13:56:29] (03PS3) 10Gehel: Revert "Revert "Add L and M to allowed statement starts"" [puppet] - 10https://gerrit.wikimedia.org/r/547541 (https://phabricator.wikimedia.org/T222321) (owner: 10DCausse) [13:57:01] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/552062 (owner: 10Filippo Giunchedi) [13:57:23] (03PS16) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [13:57:41] (03PS4) 10Gehel: Revert "Revert "Add L and M to allowed statement starts"" [puppet] - 10https://gerrit.wikimedia.org/r/547541 (https://phabricator.wikimedia.org/T222321) (owner: 10DCausse) [14:00:17] 10Operations, 10Commons, 10SRE-swift-storage: File on Commons not found: File:Nl-gegourmet.ogg - https://phabricator.wikimedia.org/T238695 (10Aklapper) [14:05:16] (03PS1) 10Ema: cache: reimage cp2016 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552068 (https://phabricator.wikimedia.org/T227432) [14:06:57] !log depool cp2016 and reimage as text_ats T227432 [14:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:03] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:07:36] (03CR) 10Ema: [C: 03+2] cache: reimage cp2016 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552068 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:08:11] godog: OK to puppet-merge your change too? [14:08:21] ema: whoops, yes thank you [14:08:46] done! [14:08:55] (03CR) 10Gehel: [C: 03+2] Revert "Revert "Add L and M to allowed statement starts"" [puppet] - 10https://gerrit.wikimedia.org/r/547541 (https://phabricator.wikimedia.org/T222321) (owner: 10DCausse) [14:09:31] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2016.codfw.wmnet'] ` The log can be found in `/var/log/wm... [14:09:40] (03CR) 10MarcoAurelio: Restrict editing CNBanner namespace to autoconfirmed on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [14:11:26] (03PS6) 10Giuseppe Lavagetto: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [14:11:53] (03CR) 10Volans: "Replied inline" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:12:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] kubernetes::deployment_server: Add a private/general.yaml file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [14:15:20] (03PS7) 10Giuseppe Lavagetto: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [14:15:33] * addshore watches [14:16:34] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2016_v4,cp2016_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:18:55] (03CR) 10Ema: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:19:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19509/ seems to do what we want." [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [14:20:07] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2016_v4,cp2016_v6} Ema reimaging 2016 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:20:26] (03CR) 10Muehlenhoff: Add image submission mode to debmonitor client (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:20:37] <_joe_> addshore: what if I removed that redirection to file [14:20:51] <_joe_> and let systemd take care of creating the logs we want? [14:21:01] _joe_: sounds good to me [14:21:09] (03CR) 10Herron: [C: 03+2] logstash: parse DPT and SPT from ulogd events [puppet] - 10https://gerrit.wikimedia.org/r/551270 (https://phabricator.wikimedia.org/T238416) (owner: 10Herron) [14:21:19] <_joe_> addshore: also, it this a "prometheus pusher"? [14:21:31] <_joe_> I'm merging it but we usually go the other way around [14:21:44] hmm, no, only queries prometheus, so i guess not a "pusher" [14:21:44] <_joe_> we create an exporter that runs locally and let prometheus fetch metrics [14:21:50] <_joe_> ohhh I see [14:21:58] <_joe_> and why are you querying prometheus [14:22:00] <_joe_> ? [14:22:13] <_joe_> to update some value used by MediaWiki? [14:22:14] getting the lag status of all wdqs servers [14:22:27] <_joe_> if prometheus is unresponsive or down? [14:22:41] <_joe_> or if the wdqs servers are broken and don't report the lag? [14:22:54] (03CR) 10Ottomata: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [14:22:55] if no response, the value doesn't get written, and will not be taken into account for maxlag (which is fine) [14:22:59] <_joe_> sorry, usuall SRE "what if something breaks" questions [14:23:02] <_joe_> ok perfect [14:23:04] hey _joe_ looks like we both have changes staged up for puppet-merge, if you want I’ll leave it to you to enter multiple when ready [14:23:15] mine is just a logstash filter update, pretty minor thing [14:23:18] <_joe_> herron: no please go on [14:23:47] ok I’ll puppet-merge yours as well [14:23:51] <_joe_> thanks [14:24:07] <_joe_> I was just discussing details with addshore that were beyond the patch itself [14:24:08] (03CR) 10Ottomata: "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/552065 (https://phabricator.wikimedia.org/T238461) (owner: 10Elukey) [14:24:10] please give me a ping when it is running :) [14:24:50] ok puppet-merged, np [14:24:56] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [14:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:59] <_joe_> Nov 20 14:26:01 mwmaint2001 mediawiki_job_wikidata-updateQueryServiceLag[19664]: Skipping execution, not the master datacenter! [14:27:04] <_joe_> ok, perfect! [14:27:04] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:14] :D [14:27:23] <_joe_> addshore: that's what we want, that's codfw [14:27:30] yupp :) [14:32:05] <_joe_> addshore: it's working AFAICT [14:32:14] amazing! [14:32:19] <_joe_> but lemme modify the logging [14:32:24] thanks for the merge and changes [14:32:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Haven't had type to study the code for now, but overall +1 on the premise. And yes let's ship it early and iterate on it." [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:33:35] (03CR) 10DCausse: [wikidata] provide better link to statement information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [14:34:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor error, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:36:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [wikidata] provide better link to statement information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [14:36:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: add scraping of k8s envoy sidecars [puppet] - 10https://gerrit.wikimedia.org/r/549871 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [14:36:49] (03PS3) 10Effie Mouzeli: hhvm: Remove hhvm module from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) [14:37:48] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2016.codfw.wmnet'] ` and were **ALL** successful. [14:38:56] (03PS1) 10Addshore: wgWikidataOrgQueryServiceMaxLagFactor 180 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552069 (https://phabricator.wikimedia.org/T221774) [14:38:59] jouncebot now [14:39:00] No deployments scheduled for the next 3 hour(s) and 20 minute(s) [14:39:31] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:39:45] (03CR) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [14:39:58] (03CR) 10Addshore: [C: 03+2] wgWikidataOrgQueryServiceMaxLagFactor 180 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552069 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [14:40:42] (03Merged) 10jenkins-bot: wgWikidataOrgQueryServiceMaxLagFactor 180 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552069 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [14:40:45] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::maintenance::wikidata: use systemd logging [puppet] - 10https://gerrit.wikimedia.org/r/552070 [14:40:50] <_joe_> addshore: ^^ [14:40:56] * addshore looks [14:41:12] 10Operations, 10serviceops: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10CDanis) 05Open→03Resolved a:03Joe [14:41:19] (03CR) 10Addshore: [C: 03+1] profile::mediawiki::maintenance::wikidata: use systemd logging [puppet] - 10https://gerrit.wikimedia.org/r/552070 (owner: 10Giuseppe Lavagetto) [14:41:24] looks great [14:43:29] (03PS21) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [14:43:32] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [14:43:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::maintenance::wikidata: use systemd logging [puppet] - 10https://gerrit.wikimedia.org/r/552070 (owner: 10Giuseppe Lavagetto) [14:46:58] (03PS3) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) [14:47:49] !log disable puppet on all mw* servers [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:50] !log pool cp2016 with ATS backend T227432 [14:49:53] (03CR) 10jerkins-bot: [V: 04-1] cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [14:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:55] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:50:53] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (use altered lag, not raw lag) [[gerrit:552072]] (duration: 00m 53s) [14:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:58] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [14:52:12] (03PS4) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) [14:52:39] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s) [14:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:28] * addshore is done [14:59:05] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: Remove hhvm module from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:59:44] 10Operations, 10Commons, 10SRE-swift-storage: File on Commons not found: File:Nl-gegourmet.ogg - https://phabricator.wikimedia.org/T238695 (10CDanis) I grepped through both the `swiftrepl` logs on `ms-fe1005` and also the aggregated Swift mutation-operation logs on `centrallog1001` and found no mention of th... [15:00:25] (03PS2) 10Gehel: wdqs: move wdqs1007 from internal to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/551189 (https://phabricator.wikimedia.org/T238229) [15:00:32] (03CR) 10Herron: "PCC looks good for existing logstash 5 hosts https://puppet-compiler.wmflabs.org/compiler1001/19514/" [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:01:09] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) This is the data for cp3064 over that period. {F31111963, size=full} The 2 dots represent the 2 events you've mentioned. Ignore the fact that... [15:02:00] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10TechCom, 10Release-Engineering-Team (Development services): Expand Gerrit Manager permissions - https://phabricator.wikimedia.org/T234474 (10Dzahn) [15:02:28] (03CR) 10Herron: [C: 03+2] logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:02:49] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Do you expect that there's a delay for the SSL cert change to affect people? If that's the case then we can certainly see a regression ramping... [15:03:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Aside from the fact that TLS on 43192 doesn't seem to be exposed yet, +1 from me." [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:03:28] (03CR) 10Gehel: [C: 03+2] wdqs: move wdqs1007 from internal to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/551189 (https://phabricator.wikimedia.org/T238229) (owner: 10Gehel) [15:03:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:04:33] (03CR) 10Jforrester: "Puppet reference removed in I1fcab6914d0b8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [15:05:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add LVS entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550914 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:05:11] (03PS2) 10Alexandros Kosiaris: Add LVS entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550914 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:05:50] (03CR) 10Ottomata: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:05:50] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 4 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) modules still using base::service_unit: - confd - varnishkafka - mediawiki (cgroups) - udp2log - uwsgi - redis - service - (base) (base::servic... [15:05:58] !log Enable puppet on mw* [15:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:54] (03PS1) 10Ema: ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) [15:08:17] !log reset LVS weight for wdqs public eqiad to 10 [15:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:26] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [15:08:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable TLS envoyproxy for eventgate-logging-external instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/551263 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:10:57] (03CR) 10Jcrespo: "I would wish all my reviews were just a case issue! :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:18:46] (03PS1) 10Ema: cache: reimage cp2019 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552077 (https://phabricator.wikimedia.org/T227432) [15:19:19] (03PS6) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [15:19:28] !log depool cp2019 and reimage as text_ats T227432 [15:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:33] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:20:01] (03CR) 10Jcrespo: "Let me know what you think." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:21:32] (03CR) 10Jcrespo: "Bad wording." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:21:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp2019 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552077 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:23:19] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2019.codfw.wmnet'] ` The log can be found in `/var/log/wm... [15:27:56] o/ Amir1 [15:28:08] o/ [15:28:16] Amir1: mw1223 seems to be doing the right thing? [15:28:23] https://usercontent.irccloud-cdn.com/file/Emo1hdHm/image.png [15:28:29] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2019_v4,cp2019_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:28:34] mw1232 is not doing the right thing [15:29:01] Amir1: shall I just sync IS.php again? [15:29:15] (03PS5) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [15:29:34] let me login to mw1232 double check [15:29:39] ack [15:29:39] what do you think? [15:29:52] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2019_v4,cp2019_v6} Ema reimaging cp2019 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:30:09] 10Operations, 10ops-eqiad, 10serviceops: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) [15:30:53] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fas... [15:33:02] Amir1: any joy? [15:33:07] addshore: yup, in mw1232, IS.php says 180 for the lag [15:33:46] https://usercontent.irccloud-cdn.com/file/X535u9iV/image.png [15:33:52] I seem to remember something like this happening in the past, and touching the files was needed or something? [15:34:16] I don't know, I think it's not sync'ed at all [15:34:29] I'll resync now [15:34:45] maybe some nodes are not in the scap list and they are serving traffic, I hope not [15:36:08] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s) [15:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:13] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [15:36:24] Amir1: that seemed to do it.... [15:36:35] Now I'm seeing the same thing (correct thing) from all hosts [15:36:41] how odd.. [15:36:57] yeah, meh :D [15:38:46] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [15:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] Amir1, addshore: can i interject with a quick sync that is for beta only, so no op for prod? [15:39:29] mobrovac: yup, im all done! [15:39:36] (03PS6) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [15:39:38] gr8 thnx [15:39:46] (03CR) 10Mobrovac: [C: 03+2] [Beta] Use Parsoid/PHP for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [15:40:43] (03Merged) 10jenkins-bot: [Beta] Use Parsoid/PHP for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [15:40:51] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:14] (03CR) 10Jcrespo: "Case + package dependency" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:42:49] !log mobrovac@deploy1001 Synchronized wmf-config/LabsServices.php: [BETA-ONLY] Switch Flow to use Parsoid/PHP - T229078 (duration: 00m 52s) [15:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:55] T229078: Preparing Flow for Parsoid-PHP switch - https://phabricator.wikimedia.org/T229078 [15:44:07] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:47:10] 10Operations, 10ops-codfw, 10observability, 10User-fgiunchedi: Update label and switch port for wezen -> centrallog2001 - https://phabricator.wikimedia.org/T238642 (10Papaul) 05Open→03Resolved complete [15:47:12] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10Papaul) [15:50:48] 10Operations, 10ops-codfw, 10decommission: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/9; [edit interfaces interface-range disabled] member ge-6/0/1... [15:50:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] TLS envoyproxy support for eventgate chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551253 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:52:08] (03CR) 10BBlack: "Replaced all the string ones with the more-generic [^; ]." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551840 (owner: 10BBlack) [15:52:13] (03PS1) 10Jbond: puppet-export-facts: cache the cacert value [puppet] - 10https://gerrit.wikimedia.org/r/552079 [15:52:37] (03PS1) 10BBlack: [WIP] Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 [15:52:57] addshore: icinga is screaming at me for the edit rate [15:53:08] aand it's back [15:53:12] Amir1: yes, i just altered the aler threshold [15:53:30] it was 120, changed it down to 60 [15:53:46] perhaps it should be less even, or removed as an alert? [15:54:39] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2019.codfw.wmnet'] ` and were **ALL** successful. [15:55:05] (03PS1) 10Mforns: analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) [15:55:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/552079 (owner: 10Jbond) [15:55:24] (03PS2) 10BBlack: [WIP] Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 [15:55:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] Kafka producer TLS support for eventgate charts (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:56:06] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/552079 (owner: 10Jbond) [15:56:16] (03CR) 10Jbond: [C: 03+2] puppet-export-facts: cache the cacert value [puppet] - 10https://gerrit.wikimedia.org/r/552079 (owner: 10Jbond) [15:56:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add discovery entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550915 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:56:45] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [15:57:28] (03PS2) 10Mforns: analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) [15:57:58] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10akosiaris) >>! In T236386#5672742, @Ottomata wrote: > @Joe @akosiaris @ema I'd like to move forward with these patches this... [15:59:55] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) Thank you! I like your suggestions on the kafka producer TLS one, will implement. Joe can help with the rest tod... [16:00:17] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:00:36] (03PS5) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [16:01:59] (03PS8) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [16:02:01] (03PS7) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [16:02:03] (03PS6) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [16:02:32] (03CR) 10jerkins-bot: [V: 04-1] Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [16:02:56] (03CR) 10Jcrespo: "First comment done." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:03:57] (03PS3) 10Mforns: analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) [16:03:58] !log depool wdqs1004 to allow catching up on lag - T238229 [16:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [16:07:12] 10Operations, 10Traffic, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10ema) On cp1075: ` $ sudo stap -ve 'probe process("/usr/bin/traffic_server").statement("retry_server_connection_not_open@./proxy/http/HttpTransact.cc:3612") { printf("%d retry %d max_retries resp... [16:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3314 after compression', diff saved to https://phabricator.wikimedia.org/P9695 and previous config saved to /var/cache/conftool/dbconfig/20191120-160813-marostegui.json [16:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load: fix neflow sanitization [puppet] - 10https://gerrit.wikimedia.org/r/552082 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:14:11] !log pool cp2019 with ATS backend T227432 [16:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:16] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:16:03] (03CR) 10Elukey: ATS/varnish: rename thorium director to analytics-web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [16:16:40] (03CR) 10Elukey: ATS/varnish: rename thorium director to analytics-web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [16:19:25] (03PS1) 10Ema: cache: reimage cp2023 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552083 (https://phabricator.wikimedia.org/T227432) [16:23:25] !log depool cp2023 and reimage as text_ats T227432 [16:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:29] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:24:01] (03CR) 10Ema: [C: 03+2] cache: reimage cp2023 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552083 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [16:25:29] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var/log/wm... [16:27:21] (03PS2) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) [16:28:02] (03CR) 10Muehlenhoff: "Are these classes really still relevant? It pulls in google perf tools and a selection of some debug packages which were originally releva" [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [16:28:04] (03CR) 10CRusnov: "PS2 actually git adds the files that were neglected, which is the actual meat of the cheange." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:28:48] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2023_v4,cp2023_v6} Ema reimaging cp2023 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:30:17] (03PS4) 10DCausse: Support /entity/ and other Wikidata URLs for Commons [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [16:30:19] (03PS3) 10DCausse: [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) [16:30:21] (03CR) 10jerkins-bot: [V: 04-1] netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:37:36] (03PS3) 10BBlack: Parallelize authdns-update with clush [puppet] - 10https://gerrit.wikimedia.org/r/552081 (https://phabricator.wikimedia.org/T98006) [16:37:38] (03PS1) 10BBlack: authdns-local-update: non-verbose by default [puppet] - 10https://gerrit.wikimedia.org/r/552085 (https://phabricator.wikimedia.org/T98006) [16:38:41] 10Operations, 10Wikimedia-Mailing-lists: Create OpenGLAM mailing list - https://phabricator.wikimedia.org/T238759 (10SandraF_WMF) [16:40:27] 10Operations, 10Wikimedia-Mailing-lists: Create OpenGLAM mailing list - https://phabricator.wikimedia.org/T238759 (10SandraF_WMF) [16:40:52] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:00] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:12] (03CR) 10Lucas Werkmeister (WMDE): [wikidata] provide better link to statement information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [16:43:56] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:45:41] (03PS3) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) [16:48:35] (03CR) 10jerkins-bot: [V: 04-1] netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:49:11] jouncebot: next [16:49:12] In 1 hour(s) and 10 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T1800) [16:49:15] jouncebot: now [16:49:16] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [16:49:59] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) This just happened on cp2023 too. [16:50:25] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var... [16:51:15] (03CR) 10Filippo Giunchedi: prometheus-bacula-exporter: Setup bacula collection on prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:53:14] (03PS4) 10Cmjohnson: Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) [16:55:12] !log installing rpcbind bugfix updates from buster 10.2 point release [16:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:35] (03CR) 10Cmjohnson: [C: 03+2] Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [16:55:42] (03CR) 10Volans: "Sorry in a meeting and with another to do, is it still open to discuss options here?" [puppet] - 10https://gerrit.wikimedia.org/r/552081 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [16:56:18] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10TJones) 05Open→03Resolved [16:56:51] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Refactor Puppet WDQS module to make it usable for wdqs and cqs - https://phabricator.wikimedia.org/T232297 (10TJones) 05Open→03Resolved [16:57:03] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10TJones) 05Open→03Resolved [16:57:23] 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10crusnov) p:05Triage→03Normal [16:57:59] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10crusnov) p:05Triage→03Normal [16:58:22] !log disabling puppet on cloudvirt1003 and 1004 for T210715 [16:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:27] T210715: cloudvps: PDNS 3.x vs 4.x - https://phabricator.wikimedia.org/T210715 [16:58:43] 10Operations, 10Gerrit-Privilege-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10crusnov) p:05T... [16:59:06] (03PS5) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) [17:00:48] (03PS1) 10Cmjohnson: Adding dhcpd file for ms-be1057 [puppet] - 10https://gerrit.wikimedia.org/r/552090 (https://phabricator.wikimedia.org/T237438) [17:02:26] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [17:03:25] !log upgrading pdns to version 4 on cloudvirt1004 T210715 [17:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:32] T210715: cloudvps: PDNS 3.x vs 4.x - https://phabricator.wikimedia.org/T210715 [17:04:52] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [17:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:10] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [17:06:54] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] (03PS1) 10Andrew Bogott: pdns: Fix typo re: pdns_api_key [puppet] - 10https://gerrit.wikimedia.org/r/552093 (https://phabricator.wikimedia.org/T210715) [17:08:21] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10RobH) Please note that this has two entries in netbox: https://netbox.wikimedia.org/dcim/devices/2267/ https://netbox.wikimedia.org/dcim/devices/2318/ (this entry has b... [17:08:45] (03PS7) 10Effie Mouzeli: mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [17:10:02] (03PS7) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [17:11:52] (03CR) 10Andrew Bogott: [C: 03+2] pdns: Fix typo re: pdns_api_key [puppet] - 10https://gerrit.wikimedia.org/r/552093 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [17:15:06] 10Operations, 10Gerrit-Privilege-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10hashar) 05Open... [17:16:01] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10BBlack) There were two TLS-level changes to the certificate output for esams specifically, each of which bumped the output size (t... [17:17:10] (03PS1) 10Mobrovac: Citoid: Update image to 2019-11-20-144606-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552094 (https://phabricator.wikimedia.org/T238083) [17:17:46] (03CR) 10Mobrovac: [C: 03+2] Citoid: Update image to 2019-11-20-144606-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552094 (https://phabricator.wikimedia.org/T238083) (owner: 10Mobrovac) [17:17:58] (03Merged) 10jenkins-bot: Citoid: Update image to 2019-11-20-144606-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552094 (https://phabricator.wikimedia.org/T238083) (owner: 10Mobrovac) [17:18:57] !log upgrading pdns to version 4 on cloudservices1003 [17:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:21] !log mobrovac@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [17:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:36] (03CR) 10Jcrespo: "So I am a bit concerned about scraping every minute, because it is quite CPU (python string handling is relatively inefficient, which so m" [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:20:53] (03PS1) 10Muehlenhoff: Align mw partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/552096 [17:20:59] !log mobrovac@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' . [17:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:54] PROBLEM - Check the last execution of sync_check_icinga_contacts on icinga1001 is CRITICAL: CRITICAL: Status of the systemd unit sync_check_icinga_contacts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:22:04] PROBLEM - Check systemd state on icinga1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:20] (03PS8) 10Jcrespo: prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) [17:24:32] !log mobrovac@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [17:24:36] (03PS8) 10Jcrespo: prometheus-bacula-exporter: Setup bacula collection on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) [17:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus-bacula-exporter: Setup service on bacula director [puppet] - 10https://gerrit.wikimedia.org/r/552027 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:25:51] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` and were **ALL** successful. [17:40:40] (03CR) 10MarcoAurelio: [C: 03+1] "Code looks technically correct. Would you be willing to enable this also on Meta-Wiki so local CUs can also test it/provide feedback at th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [17:40:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/552033 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:40:44] !log pool cp2023 with ATS backend T227432 [17:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:50] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [17:41:05] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [17:46:14] (03PS1) 10Andrew Bogott: Designate: allow projectadmins to create zones. [puppet] - 10https://gerrit.wikimedia.org/r/552104 [17:46:16] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add config files for version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/550502 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [17:47:11] (03PS2) 10Andrew Bogott: Designate: allow projectadmins to create zones. [puppet] - 10https://gerrit.wikimedia.org/r/552104 [17:51:08] 10Operations, 10DNS, 10SRE-tools, 10Traffic: Include zone+subnet checks for DNS validation - https://phabricator.wikimedia.org/T238727 (10Volans) @fgiunchedi I think is fair request, but given we're in process of auto-generating all mgmt and then server's DNS records this might have less benefit that in th... [17:52:40] (03Abandoned) 10Andrew Bogott: Designate: allow projectadmins to create zones. [puppet] - 10https://gerrit.wikimedia.org/r/552104 (owner: 10Andrew Bogott) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:29] yes there is a gerrit patch [18:00:33] it's mine [18:00:35] i'll SWAT it [18:01:06] (03PS3) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [18:01:43] (03CR) 10Ottomata: Kafka producer TLS support for eventgate charts (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [18:02:09] it’s annoying that jouncebot doesn’t auto-refresh right before an announcement [18:03:15] (03PS4) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [18:03:21] jouncebot: refresh [18:03:22] I refreshed my knowledge about deployments. [18:03:39] Lucas_WMDE: yup, maybe a feat. req. for the future [18:05:36] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.1 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:08:25] (03PS1) 10Herron: prometheus: add role::lists class to mtail exim metrics [puppet] - 10https://gerrit.wikimedia.org/r/552105 [18:08:58] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:09:33] ^ thats a new check that notifies if errors in the logstash logs themselves spike [18:13:29] !log restart mtail on fermium [18:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:38] (03Abandoned) 10Herron: prometheus: add role::lists class to mtail exim metrics [puppet] - 10https://gerrit.wikimedia.org/r/552105 (owner: 10Herron) [18:17:25] !log mobrovac@deploy1001 Synchronized php-1.35.0-wmf.5/includes/libs/virtualrest/ParsoidVirtualRESTService.php: Parsoid VRS: Add the Host header - T229015 T229078 T229074 (duration: 00m 52s) [18:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:32] T229078: Preparing Flow for Parsoid-PHP switch - https://phabricator.wikimedia.org/T229078 [18:17:32] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [18:17:32] T229074: Preparing VisualEditor for Parsoid-PHP switch - https://phabricator.wikimedia.org/T229074 [18:17:44] ok i'm done with swat [18:17:55] !log morning SWAT done [18:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:55] (03PS1) 10MarcoAurelio: Add .gitreview [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/552106 [18:21:34] (03PS1) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [18:22:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10ahemmer) @nuria Approved as Cherraye's manager. [18:23:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Nuria) Approved on my end. [18:24:14] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] "New repository." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/552106 (owner: 10MarcoAurelio) [18:30:56] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) The following changes have been made to be compatible with Buster: - Require package change... [18:31:28] !log ganeti - introducing and installing buster on new VMs xhgui1001/xhgui2001 - for replacing tungsten (jessie) T238098 [18:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:34] T238098: vm request for xhgui - https://phabricator.wikimedia.org/T238098 [18:31:53] (03CR) 10Effie Mouzeli: [C: 03+1] Align mw partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/552096 (owner: 10Muehlenhoff) [18:32:43] (03CR) 10Dzahn: [C: 03+1] Align mw partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/552096 (owner: 10Muehlenhoff) [18:33:01] (03CR) 10Dzahn: [C: 03+2] Align mw partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/552096 (owner: 10Muehlenhoff) [18:33:40] (03PS1) 10Andrew Bogott: pdns monitoring: try to resolve target_fqdn rather than target_host [puppet] - 10https://gerrit.wikimedia.org/r/552109 (https://phabricator.wikimedia.org/T210715) [18:36:24] (03CR) 10Dzahn: [C: 03+1] pdns monitoring: try to resolve target_fqdn rather than target_host [puppet] - 10https://gerrit.wikimedia.org/r/552109 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [18:36:58] (03CR) 10Andrew Bogott: [C: 03+2] pdns monitoring: try to resolve target_fqdn rather than target_host [puppet] - 10https://gerrit.wikimedia.org/r/552109 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [18:37:56] jouncebot now [18:37:57] For the next 0 hour(s) and 22 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T1800) [18:39:56] (03PS1) 10Addshore: wgWikidataOrgQueryServiceMaxLagFactor 170 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552110 [18:40:01] heh, i edited the reviewer-bot page the other day and one missing \ and now i got just a few more review requests than expected :p [18:40:32] (03CR) 10Addshore: [C: 03+2] wgWikidataOrgQueryServiceMaxLagFactor 170 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552110 (owner: 10Addshore) [18:41:18] (03Merged) 10jenkins-bot: wgWikidataOrgQueryServiceMaxLagFactor 170 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552110 (owner: 10Addshore) [18:42:03] (03PS1) 10CDanis: Varnish-repool blubberoid in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/552111 [18:42:40] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) Version 1.10.0-1~wmf1 has been deployed to `deployment-mediawiki-09` and `deployment-mediawiki-07`. Please let me know if it w... [18:42:56] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 170 (duration: 00m 53s) [18:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:01] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [18:43:27] (03PS2) 10CDanis: Varnish-repool blubberoid in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/552111 [18:44:15] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Dzahn) 05Open→03Resolved VMs have been created and OS is installed. Added to site.pp with the spare::system role. [18:44:17] 10Operations, 10vm-requests: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools - https://phabricator.wikimedia.org/T194390 (10Dzahn) [18:45:57] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Dzahn) [18:46:15] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Dzahn) [18:46:18] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Dzahn) [18:46:25] (03CR) 10CDanis: [C: 03+2] Varnish-repool blubberoid in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/552111 (owner: 10CDanis) [18:46:33] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Dzahn) [18:47:49] hi ops, I'm an administrator of the Analytics mailing list and I got about 200 bounce action notification emails from mailman in the last 15 minutes, is that expected? who should I talk with, thanks! [18:48:03] (03PS1) 10Addshore: wgWikidataOrgQueryServiceMaxLagFactor to 120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552112 [18:48:05] mutante ? [18:48:18] mforns: no, that's raro [18:48:21] weird* [18:48:25] (03CR) 10Addshore: [C: 03+2] wgWikidataOrgQueryServiceMaxLagFactor to 120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552112 (owner: 10Addshore) [18:48:35] hm [18:48:46] (03CR) 10Dzahn: Adding dhcpd file for ms-be1057 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552090 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [18:49:58] (03Merged) 10jenkins-bot: wgWikidataOrgQueryServiceMaxLagFactor to 120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552112 (owner: 10Addshore) [18:50:54] mforns: I'd file a task and ping the clinic duty person [18:50:58] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [18:51:10] hauskater, ok, thanks! [18:51:14] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 54s) [18:51:18] de nada :) [18:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:19] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [18:56:43] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 50s) [18:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:48] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [18:56:54] Amir1: ^^ odd, again, the first sync didn't do it, but a resync did [18:58:47] addshore: I would say we definitely should make a phabricator ticket against scap [18:59:10] (03PS1) 10CDanis: test change for PCC only do not submit [puppet] - 10https://gerrit.wikimedia.org/r/552114 [18:59:20] I'll write it now Amir1 :) [19:01:52] 10Operations: [Mailing lists] Received 205 bounce action notification emails from mailman in 20 minutes - https://phabricator.wikimedia.org/T238780 (10mforns) [19:03:42] (03PS1) 10Mholloway: Update wikifeeds to 2019-11-20-152441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552115 [19:04:35] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: [Mailing lists] Received 205 bounce action notification emails from mailman in 20 minutes - https://phabricator.wikimedia.org/T238780 (10MarcoAurelio) [19:04:49] !log xhgui1001 - initial puppet run, signed puppet cert on puppetmaster1001 [19:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:30] 10Operations, 10serviceops: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10ayounsi) p:05Triage→03Normal [19:10:33] 10Operations, 10serviceops: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Dzahn) a:03Dzahn [19:18:04] 10Operations, 10serviceops: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Dzahn) [19:18:09] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) [19:19:54] 10Operations, 10serviceops: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Dzahn) Port 22280/tcp is the aphlict service which is currently disabled (T238593). In addition to the backend It has also been disabled in ATS (https://gerrit.wikimedia.org/r/c/operations/puppet/+... [19:28:20] (03PS1) 10Dzahn: varnish: remove config for disabled phab_aphlict [puppet] - 10https://gerrit.wikimedia.org/r/552122 (https://phabricator.wikimedia.org/T238781) [19:32:42] (03PS2) 10Dzahn: varnish: remove config for disabled phab_aphlict [puppet] - 10https://gerrit.wikimedia.org/r/552122 (https://phabricator.wikimedia.org/T238781) [19:33:24] (03CR) 10Dzahn: "also needs this https://gerrit.wikimedia.org/r/c/operations/puppet/+/552122 ? because of https://phabricator.wikimedia.org/T238781 ?" [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [19:35:29] (03PS4) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) [19:36:45] (03CR) 10Cmjohnson: [C: 03+2] Adding dhcpd file for ms-be1057 [puppet] - 10https://gerrit.wikimedia.org/r/552090 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [19:37:01] 10Operations, 10serviceops, 10Patch-For-Review: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10Dzahn) @ema @Vgutierrez So this was removed from ATS in https://gerrit.wikimedia.org/r/c/operations/puppet/+/551731 but does it also need https://gerrit.wikimedia.org/r/c/oper... [19:37:06] 10Operations, 10Analytics, 10Event-Platform, 10Wikimedia-Logstash, 10observability: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10Ottomata) a:03Ottomata [19:41:42] (03CR) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) (owner: 10Ammarpad) [19:42:10] (03PS5) 10Ammarpad: Restrict editing CNBanner namespace to autoconfirmed on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552024 (https://phabricator.wikimedia.org/T238723) [19:45:47] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10observability, and 3 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10Ottomata) 05Open→03Declined We'll be moving EventStreams to k8s next quarter, which will take advantage of new logging pipelin... [19:45:51] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Ottomata) [19:46:41] (03CR) 10Niharika29: "Marco, I think it is a bit too early to enable it on meta to be used with real data. We might accidentally cause a security issue as all o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [19:50:04] 10Operations, 10Analytics, 10serviceops-radar, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) 05Stalled→03Resolved a:03Ottomata This is now supported via Kafka, Swift and an Oozie workflow. {T... [19:51:27] 10Operations, 10Analytics, 10Discovery, 10Article-Recommendation, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) 05Open→03Resolved a:03Ottomata This was finished back in July.... [19:51:30] 10Operations, 10Analytics, 10serviceops-radar, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) [19:57:52] (03PS1) 10Dzahn: add xhgui::app role on xhgui VMs [puppet] - 10https://gerrit.wikimedia.org/r/552124 [19:57:56] (03CR) 10MarcoAurelio: [C: 03+1] "Thanks Niharika. What you say makes sense. We'll wait until the tool is stable :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [20:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T2000). [20:05:03] parsoid deploy coming in a little bit. [20:06:44] (03PS1) 10Effie Mouzeli: dumps: fix hieradata for generation::worker::dumper_misc_crons [puppet] - 10https://gerrit.wikimedia.org/r/552125 [20:06:49] 10Operations, 10ops-codfw, 10decommission: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Papaul) [20:13:09] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-20-152441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552115 (owner: 10Mholloway) [20:13:21] (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-20-152441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552115 (owner: 10Mholloway) [20:13:36] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Papaul) [20:15:15] (03CR) 10Effie Mouzeli: [C: 04-1] "That is not good https://puppet-compiler.wmflabs.org/compiler1001/19517/snapshot1008.eqiad.wmnet/prod.snapshot1008.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/552125 (owner: 10Effie Mouzeli) [20:16:33] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927 (10Ottomata) 05Open→03Declined Old task, I think we aren't likely to do this. Declining, feel free to reopen i... [20:19:11] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [20:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:25] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:21:35] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [20:21:43] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:21:51] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:21:51] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:22:01] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:22:07] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:33] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:23:15] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [20:23:23] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:23:31] !log notebook1003 - sudo systemctl nagios-nrpe-server (as usual ....) [20:23:33] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:33] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:43] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:23:49] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:13] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:24:45] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:25:06] deploying parsoid now. [20:25:39] (03PS1) 10Dzahn: make xhgui::app role support buster [puppet] - 10https://gerrit.wikimedia.org/r/552126 (https://phabricator.wikimedia.org/T238788) [20:27:03] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10Krinkle) 05Resolved→03Open Re-opening but not 100% sure it was this change that caused the issue. The issue - When `X-Wikimedia-Debug` is enabled (e.g. via the Wikime... [20:27:05] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10Krinkle) [20:27:07] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [20:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:20] 10Operations, 10serviceops: dropped packets to echostore.svc.eqiad 8082/tcp - https://phabricator.wikimedia.org/T238789 (10ayounsi) p:05Triage→03Normal [20:27:56] !log ssastry@deploy1001 Started deploy [parsoid/deploy@d5646b7]: Updating Parsoid to 2e79460d [20:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [20:31:51] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [20:31:54] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [20:34:40] 10Operations, 10observability, 10serviceops: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10ayounsi) [20:37:10] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@d5646b7]: Updating Parsoid to 2e79460d (duration: 09m 14s) [20:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:09] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) p:05Triage→03Normal [20:46:15] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) In addition prometheus2003/4.codfw.wmnet are also trying to reach 9700/tcp on kafkamon2001 only. [20:50:28] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) Looks like it is still happening: https://log... [20:51:52] doing another dummy parsoid deploy to test a parsoid scap deployment config fix. [20:53:50] !log ssastry@deploy1001 Started deploy [parsoid/deploy@7665624]: Dummy Parsoid deploy to test T238748 fix [20:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:55] T238748: Class not found transient errors after Parsoid/PHP scap3 deploys - https://phabricator.wikimedia.org/T238748 [20:55:23] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) >>! In T238593#5678148, @Dzahn wrote: > As Mukunda pointed out the aphlict service does not even... [20:55:27] (03CR) 10Dzahn: [wikidata] provide better link to statement information (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [20:56:22] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Dzahn) [21:00:04] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Creating Several Wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T2100). [21:00:26] Amir1: Hi. Do you think we can start with ge.wikimedia first? It's the oldest. [21:00:39] Moon is better, I would go with trying to take the moon, that's easier [21:00:57] hauskater: sure [21:01:07] Amir1: let you know if you need anything [21:01:10] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@7665624]: Dummy Parsoid deploy to test T238748 fix (duration: 07m 20s) [21:01:15] I don't have the list here, let's wait for Urbanecm for a little bit [21:01:19] oh you're here [21:01:22] yup [21:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:24] T238748: Class not found transient errors after Parsoid/PHP scap3 deploys - https://phabricator.wikimedia.org/T238748 [21:01:29] Amir1: thanks, like Martin, please let me know if I can help with anything [21:01:35] although I can't do much at this point [21:01:36] oh! multi wiki creation! nice [21:02:00] Amir1: the user group is T236389 [21:02:00] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [21:02:01] Urbanecm: first thing, can I have the list of wikis on order of importance? which one first [21:02:06] Yeah, we're in the maternity wing of the Hospital mutante [21:02:26] Amir1: the one I linked, rest doesn't matter IMO [21:02:26] look at those baby wikis [21:03:18] okay [21:03:29] Urbanecm: do you want to do any of them yourself? [21:03:49] Amir1: I'd prefer you do all of them [21:04:36] Are you sure? The buses in Berlin are horrible. We need a backup person [21:05:22] especially if you're going to be away :P I'm not sure if I can fix any mess it can cause :) [21:06:13] (03CR) 10DCausse: [wikidata] provide better link to statement information (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [21:06:21] I'm around. You do the first wiki, I do the rest [21:06:42] Okay. [21:07:22] Amir1: first thing I have to do is to merge the config and pull onto mwmaint, right? [21:07:31] (03PS4) 10DCausse: [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) [21:08:09] Urbanecm: yup [21:08:46] (03PS17) 10Urbanecm: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [21:08:47] doing [21:08:51] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [21:09:41] (03Merged) 10jenkins-bot: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [21:11:25] I made am.wikimedia.org before, it's not that hard, the interwiki handling is a little bit annoying [21:11:41] Amir1: next step is `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki ka wikimedia gewikimedia ge.wikimedia.org`? [21:12:41] if you pulled it yes [21:12:57] yes, pulled [21:13:02] doing [21:14:37] Amir1: that was faster than I expected. Now, I should sync dblists, yes? [21:14:59] yes [21:15:03] make sure they are there [21:15:14] okay [21:16:52] !log urbanecm@deploy1001 Synchronized dblists: new wiki gewikimedia (T236389) (duration: 00m 52s) [21:16:55] "ka wikimedia gewikimedia ge.wikimedia.org"? either database name or subdomain, right? [21:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:58] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [21:17:02] isnt that at least a space too many? [21:17:16] ka is the language, ge the country code [21:17:20] it's a chapter wiki [21:17:29] as hauskater says [21:17:47] "gewikimedia" is the db name [21:17:53] aye [21:17:55] ok [21:18:01] Amir1: next would be scap sync-wikiversions, right? [21:18:24] (the initial config patch already has touched wikiversions) [21:18:29] since it's already here sure [21:18:29] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/545909/17/wikiversions.json [21:18:55] ok, doing [21:19:27] after this, you can pull it in mwdebug and see if it's working or not [21:19:33] before going forward [21:19:46] I added the wikiversions because I knew it was happening today. Shouldn't have I? [21:19:56] 10Operations, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10ayounsi) p:05Triage→03Lowest [21:20:01] hauskater: it's fine, I would have to do that myself otherwise :-) [21:20:12] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: new wiki gewikimedia (T236389) [21:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:25] I usually don't do it because it's what causes most merge conflicts later [21:20:27] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:20:32] manual rebase == pita [21:20:35] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:20:45] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:20:55] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:20:57] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:21:04] hauskater: yeah, I do it separately but it doesn't matter when the patch is changed close to deployment [21:21:07] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:21:09] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:37] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:22:07] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:22:17] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:22:27] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:22:35] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:22:39] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:22:41] Amir1: fails with `Query: SELECT rt_revision FROM `revtag` WHERE rt_page = '1' AND rt_type = 'tp:mark' ORDER BY rt_revision DESC LIMIT 1 [21:22:41] Function: TranslatablePage::getTag Error: 1146 Table 'gewikimedia.revtag' doesn't exist (10.64.0.205)` [21:22:49] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:22:53] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:59] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Neil_P._Quinn_WMF) @SBisson I looked over the patch and [the schema](https://meta.wikimedia.org/wiki/Schema:InukaPageView)... [21:23:07] okay [21:23:09] looking [21:23:09] !log notebook1003 - systemctl start nagios-nrpe-server (second time today already today T212824) [21:23:11] given it has enabled Translate, it's because I didn't create tables for Translate? [21:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:14] Amir1: ^^ [21:23:14] T212824: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 [21:23:15] my guess [21:23:19] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:23:24] yes [21:23:37] creating it is not hard [21:23:39] Urbanecm: maybe the Translate tables are missing [21:23:47] it is [21:23:54] Amir1: `mwscript extensions/WikimediaMaintenance/createExtensionTables.php gewikimedia Translate`, right? [21:23:56] createExtensionTables.php translate [21:24:04] yup [21:24:16] Urbanecm: also translation notifications, if they have [21:24:29] that's separate extension? [21:24:46] Amir1: wow, it seems to work! [21:24:49] (03PS1) 10Krinkle: webperf: Remove xhgui profile from webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/552135 (https://phabricator.wikimedia.org/T180761) [21:25:03] hauskater: I don't think so, that's for translate repo [21:25:07] Urbanecm yes, https://www.mediawiki.org/wiki/Extension:TranslationNotifications [21:25:11] (e.g. translatwiki.net) [21:25:17] Urbanecm: Yep, TranslationNotifications is a different extension [21:25:29] Amir1: hauskater: DannyS712: well, it has wmgUseTranslationNotifications [21:25:33] set to true [21:25:39] but it's not configured for createExtensionTables. Maybe it does not need any [21:25:52] it doesn't have any tables AFAIS [21:25:54] it's marked as "database changes: no" at mediawiki [21:26:01] The wiki is live now :) [21:26:03] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/TranslationNotifications/+/master [21:26:11] yup, seems to work :) [21:26:40] Awesome [21:26:42] Thanks [21:26:42] now, the rest [21:26:51] scap sync-file multiversion/MWMultiVersion.php and so on, I guess [21:27:10] yup [21:27:40] doing [21:27:57] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:05] !log urbanecm@deploy1001 Synchronized multiversion/MWMultiVersion.php: new wiki gewikimedia (T236389) (duration: 00m 52s) [21:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:11] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [21:28:21] doing logos now [21:28:23] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:28:55] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:29:05] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:29:15] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:29:23] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:29:26] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: new wiki gewikimedia (T236389) (duration: 00m 53s) [21:29:27] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:34] IS.php is going now [21:29:37] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:29:49] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:30:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: new wiki gewikimedia (T236389) (duration: 00m 52s) [21:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:07] Amir1: the only other thing I see at https://wikitech.wikimedia.org/wiki/Add_a_wiki#MediaWiki_configuration is interwiki cache, which can wait until the rest is created. [21:31:24] yes, also it should be added in meta [21:31:50] https://meta.wikimedia.org/wiki/Interwiki_map [21:31:58] ok [21:32:01] if you get this done while I'm doing the rest, it would be amazing [21:33:13] Urbanecm should we try to reproduce https://phabricator.wikimedia.org/T235885 with one of the new wikis? I think steps to reproduce are: before I autocreate a local account, import upload something that has an edit I made, and assign edits to local users [21:33:39] Amir1: done with https://meta.wikimedia.org/w/index.php?title=Interwiki_map&diff=19574686&oldid=19000320 [21:33:58] (03PS2) 10Dzahn: xhgui::app: add support for buster/PHP7.3 [puppet] - 10https://gerrit.wikimedia.org/r/552126 (https://phabricator.wikimedia.org/T238788) [21:34:27] DannyS712: if you can verify your theory, that would be great [21:34:59] I have no way to verify it - someone //on the wiki// with import rights is needed [21:35:00] Urbanecm: this is a fishbowl wiki [21:35:08] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) Hi there. Wiki database just created. Regards. [21:35:17] they are not connected to SUL anyway [21:35:24] Let me create another one [21:35:44] ah, Amir1's right, let's wait for the other wikis then [21:35:52] !log repool wdqs1004 - T238229 [21:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:57] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [21:36:01] DannyS712: I'd wait for regular import of content - if that's not too much waiting for you :-) [21:36:10] * Urbanecm is going to create an account for the requestor [21:36:26] ge.wikimedia - https://www.wikidata.org/wiki/Q75847675 [21:36:32] No problem - just want to make sure that its done the way I think caused a problem last time, to reproduce [21:36:39] "I can has accountz" [21:36:40] The next wiki: T236861 [21:36:41] T236861: Create Minangkabau Wiktionary - https://phabricator.wikimedia.org/T236861 [21:36:52] Just let me know when its safe to try and visit the wiki [21:37:12] DannyS712: ge.wikimedia is up and running, you can visit it; but it's a members-only wiki [21:37:23] we won't be able to create an account there [21:37:33] hauskater: they meant to reproduce the issue with the other wikis [21:37:36] (03PS6) 10Ladsgroup: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [21:37:52] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [21:38:00] !log mwscript createAndPromote.php --wiki=gewikimedia --sysop --bureaucrat Mehman97 (T236389) [21:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:05] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [21:38:38] (03Merged) 10jenkins-bot: Initial configuration for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551974 (https://phabricator.wikimedia.org/T236861) (owner: 10Urbanecm) [21:39:09] Urbanecm: you can create the account with special page for creating account and not maintaince script [21:39:53] Amir1: no, I can't, only sysops have createaccount at fishbowl wikis AFAICS [21:40:18] looks for Wikidata entity "Phabricator ticket id" [21:40:26] I remember doing it for hywikimedia [21:40:34] (or amwikimedia I forgot) [21:40:38] anyway, it's fine [21:41:01] Do we add chapter wikis to Wikistats? [21:41:47] hauskater: yea [21:41:49] http://wikistats.wmflabs.org/display.php?t=wx <-- looks so [21:42:15] hauskater: i added entity "inception" to gewiki with reference URL https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=1845430 [21:43:04] mutante: I guess that's fine, but I'm not a regular Wikidata editor [21:43:39] okay for minwiktionary, the database has been created [21:44:57] !log ladsgroup@deploy1001 Synchronized dblists: T236861 (duration: 00m 52s) [21:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:02] T236861: Create Minangkabau Wiktionary - https://phabricator.wikimedia.org/T236861 [21:45:35] (03PS1) 10Ladsgroup: Add minwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552139 (https://phabricator.wikimedia.org/T236861) [21:45:59] (03CR) 10Ladsgroup: [C: 03+2] Add minwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552139 (https://phabricator.wikimedia.org/T236861) (owner: 10Ladsgroup) [21:46:05] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:46:15] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:46:19] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:46:29] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:46:31] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:40] (03Merged) 10jenkins-bot: Add minwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552139 (https://phabricator.wikimedia.org/T236861) (owner: 10Ladsgroup) [21:46:53] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:46:57] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:47:29] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:47:37] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:49:19] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: T236861 [21:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:38] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10Urbanecm) >>! In T238522#5670499, @Marostegui wrote: > Let us know when the database is created so we can sanitize it before sending it to t... [21:50:45] okay the wiki is up, let's move forward [21:51:27] k [21:52:02] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T236861 (duration: 00m 51s) [21:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:07] T236861: Create Minangkabau Wiktionary - https://phabricator.wikimedia.org/T236861 [21:54:01] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T236861 (duration: 00m 52s) [21:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:09] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: T236861 (duration: 00m 51s) [21:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:11] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:56:25] DannyS712: try min.wiktionary.org for T235885' [21:56:26] T235885: DannyS712 wasn't attached to SUL at banwiki - https://phabricator.wikimedia.org/T235885 [21:56:31] !log ladsgroup@deploy1001 Synchronized langlist: T236861 (duration: 00m 52s) [21:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:46] Amir1 have things been imported? [21:57:04] * gehel is looking at wdqs [21:57:08] not yet, you're right, let's wait [21:57:19] gehel: should we stop? [21:57:31] Amir1: nope, please continue [21:57:50] Amir1: we've three mins till end of reserved window [21:57:51] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:58:30] It's fine, we can extend it, doing it later is way more work and risky [21:58:36] unless there's another thing going on [21:58:44] Amir1: evening swat in one hour [21:58:51] nope, we have one hour [21:58:55] let's finish it [21:58:58] Okay [21:59:05] unless the bus arrives ;) [21:59:13] Urbanecm: regarding szywiki, is it backported? [21:59:30] yup [21:59:34] let's do it then [21:59:42] Amir1: yes, should be [22:00:14] (03PS5) 10Ladsgroup: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [22:00:33] brb [22:00:34] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [22:00:46] !log Wiki creation continues [22:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:55] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [22:00:59] what? [22:01:07] (03Merged) 10jenkins-bot: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [22:01:56] hmm, temp failure it seems [22:03:21] `resubmit` re-enqueues the thing I think [22:03:43] hauskater: it seems to be merged [22:03:48] gate succeeded, main failed [22:03:54] oh, lol [22:04:01] jenkins trolling [22:04:43] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:04:45] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:33] bblack: XioNoX ^ [22:05:55] thx [22:06:27] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:06:27] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:06:48] that's GTT and it's our 3rd backup link [22:07:01] but nothing planned, good thing it was brieg [22:07:04] brief [22:07:13] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:07:45] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:07:53] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [22:08:03] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:08:13] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:08:17] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:08:27] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:08:29] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:57] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:09:19] dbs are done for szywiki [22:10:02] (03PS1) 10CDanis: varnishbe: there are no other varnishbes! use ats-be [puppet] - 10https://gerrit.wikimedia.org/r/552142 [22:10:37] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:11:00] !log ladsgroup@deploy1001 Synchronized dblists: T237369 (duration: 00m 52s) [22:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:06] T237369: Create Sakizaya Wikipedia - https://phabricator.wikimedia.org/T237369 [22:11:09] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:11:17] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [22:11:27] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:11:37] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:11:41] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:11:51] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:11:53] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:30] (03PS1) 10Ladsgroup: Add szywiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552145 (https://phabricator.wikimedia.org/T237369) [22:12:58] (03CR) 10Ladsgroup: [C: 03+2] Add szywiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552145 (https://phabricator.wikimedia.org/T237369) (owner: 10Ladsgroup) [22:13:10] (03CR) 10Ayounsi: "> Patch Set 2:" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) (owner: 10CRusnov) [22:13:12] (03CR) 10CDanis: "PCC looks correct: https://puppet-compiler.wmflabs.org/compiler1001/19519/cp1077.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552142 (owner: 10CDanis) [22:13:38] (03Merged) 10jenkins-bot: Add szywiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552145 (https://phabricator.wikimedia.org/T237369) (owner: 10Ladsgroup) [22:15:49] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: T237369 [22:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:44] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Pchelolo) Hehe... @jijiki could you do deployment-parsoid as well please? [22:17:06] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T237369 (duration: 00m 51s) [22:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:10] T237369: Create Sakizaya Wikipedia - https://phabricator.wikimedia.org/T237369 [22:17:57] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10eprodromou) >>! In T236963#5679379, @jijiki wrote: > Version 1.10.0-1~wmf1 has been deployed to `deployment-mediawiki-09` and `deploym... [22:19:22] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T237369 (duration: 00m 51s) [22:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:25] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10eprodromou) OK, I jumped the gun. Apparently that's not output from the 1.10.0 version. [22:21:13] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: T237369 (duration: 00m 52s) [22:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:26] !log ladsgroup@deploy1001 Synchronized langlist: T237369 (duration: 00m 53s) [22:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:41] T237369: Create Sakizaya Wikipedia - https://phabricator.wikimedia.org/T237369 [22:22:46] okay we are done with this wiki, let's go with gcr wiki [22:24:33] (03CR) 10Ladsgroup: [C: 03+2] "It doesn't have wikidataclient, I'll add it in another patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:24:49] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:24:49] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) >>! In T236963#5680172, @Pchelolo wrote: > Hehe... @jijiki could you do deployment-mediawiki-parsoid-* as well please? all the... [22:24:53] (03CR) 10Ladsgroup: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:25:06] (03CR) 10Ladsgroup: [C: 04-1] "real merge conflict + it doesn't have wikidataclient" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:25:16] Urbanecm: If you're still around ^ [22:25:27] .Amir1 looking [22:28:17] (03PS3) 10Urbanecm: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) [22:28:27] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 540 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:28:36] Amir1: here you are [22:28:39] (03PS4) 10Urbanecm: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) [22:29:03] thanks! [22:29:08] yw [22:29:14] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:29:54] 10Operations, 10Cloud-VPS, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10Krenair) done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/482142 ? [22:30:02] (03PS5) 10Urbanecm: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) [22:30:02] okay, /me doesn't know the alphabet [22:31:46] Amir1: jenkins accepted PS5 [22:31:56] yesss [22:32:12] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:32:55] (03Merged) 10jenkins-bot: Initial configuration for gcrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552016 (https://phabricator.wikimedia.org/T238104) (owner: 10Urbanecm) [22:33:52] Urbanecm: regarding shywiktionary, is the Wikimedia Messages needed? Do you think we can move forward with that? [22:34:12] Amir1: AFAICS, WikimediaMessages patch can be deployed/merged at any time [22:34:17] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 540 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:34:20] coool [22:36:29] it's just for the search results [22:36:43] !log ladsgroup@deploy1001 Synchronized dblists: T238104 (duration: 00m 52s) [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:48] T238104: Create Guianan Creole Wikipedia - https://phabricator.wikimedia.org/T238104 [22:37:03] it'll work, displaying in English in the meanwhile [22:38:29] (03PS1) 10Ladsgroup: Add gcrwiki to wikiversions.json, move gewikimedia to right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552147 (https://phabricator.wikimedia.org/T238104) [22:39:00] (03CR) 10Ladsgroup: [C: 03+2] Add gcrwiki to wikiversions.json, move gewikimedia to right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552147 (https://phabricator.wikimedia.org/T238104) (owner: 10Ladsgroup) [22:39:41] (03Merged) 10jenkins-bot: Add gcrwiki to wikiversions.json, move gewikimedia to right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552147 (https://phabricator.wikimedia.org/T238104) (owner: 10Ladsgroup) [22:41:32] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: T238104 [22:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:21] the wiki is up in mwdebug, moving forward [22:43:58] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T238104 (duration: 00m 51s) [22:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:03] T238104: Create Guianan Creole Wikipedia - https://phabricator.wikimedia.org/T238104 [22:45:07] (03CR) 10Muehlenhoff: "The patch looks fine, but note that Buster no longer includes mongodb as it moved to a non-free license (https://bugs.debian.org/cgi-bin/b" [puppet] - 10https://gerrit.wikimedia.org/r/552126 (https://phabricator.wikimedia.org/T238788) (owner: 10Dzahn) [22:46:57] some parts of the wiki creation can be automated like this syncings [22:47:10] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T238104 (duration: 00m 52s) [22:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:29] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: T238104 (duration: 00m 52s) [22:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:02] * hauskater can hear the baby wikis crying [22:49:14] one more is coming [22:49:20] oh dear Lord :) [22:49:47] !log ladsgroup@deploy1001 Synchronized langlist: T238104 (duration: 00m 51s) [22:49:48] which one is next [22:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:52] T238104: Create Guianan Creole Wikipedia - https://phabricator.wikimedia.org/T238104 [22:50:07] T238105 [22:50:08] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [22:50:25] it's shy, it won't cry [22:50:48] just a few sniffles eh [22:51:11] (03PS3) 10Ladsgroup: Initial configuration for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) (owner: 10Urbanecm) [22:51:28] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) (owner: 10Urbanecm) [22:52:26] (03Merged) 10jenkins-bot: Initial configuration for shywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552021 (https://phabricator.wikimedia.org/T238105) (owner: 10Urbanecm) [22:53:04] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10RStallman-legalteam) The NDA is complete and on file. Fine to proceed with next steps. Thanks! [22:54:59] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.23 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:56:58] (03PS1) 10CDanis: Varnish-repool cxserver in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/552151 [22:57:34] the database for shywiktionary is up but I'm little bit scared given that the language is 'shy-latn' and I created it with that language code [22:58:23] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:58:23] oh it doesn't matter according to addWiki.php code [22:59:03] (03CR) 10CDanis: [C: 03+2] Varnish-repool cxserver in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/552151 (owner: 10CDanis) [22:59:46] !log ladsgroup@deploy1001 Synchronized dblists: T238105 (duration: 00m 53s) [22:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:52] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [22:59:56] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191120T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] I'll do the SWAT [23:00:21] I have more patches too [23:00:28] RoanKattouw: Amir1 is still creating a wiki [23:00:31] RoanKattouw: can you wait for ten minute [23:00:38] OK I'll wait until Amir1 gives me the all clear [23:00:46] things will explode if you sync thing right now [23:01:42] RoanKattouw: Thanks. It has been two hours already [23:02:41] (03PS1) 10Ladsgroup: Add shywiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552152 (https://phabricator.wikimedia.org/T238105) [23:03:15] (03CR) 10Ladsgroup: [C: 03+2] Add shywiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552152 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [23:03:25] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [23:03:57] (03Merged) 10jenkins-bot: Add shywiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552152 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [23:04:17] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:53] (03PS4) 10EBernhardson: airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 [23:04:56] Jhs: o/ [23:05:04] (03PS9) 10EBernhardson: airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) [23:05:12] (03PS10) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [23:05:21] Amir1, i've said it before, but it needs repeating: you're awesome :D [23:05:42] nah, we should thank the person who fixed it :D [23:05:50] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: T238105 [23:05:53] I just landed after a 2 hour flight. will start importing immediately [23:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:55] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [23:05:57] Amir1, who's that? [23:06:11] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Koavf) I object to deletion: as long as we still own the domain names (that is, "we" being the WMF, not us personally), URIs should stay... [23:06:41] shywiktionary is up in mwdebug, moving forward [23:06:49] Jhs: Told you before :P [23:07:10] Jhs thanks, can you please make sure to import something that I have edited as import upload with assigning edits to local users? [23:08:00] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T238105 (duration: 00m 51s) [23:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:27] DannyS712, yes, we always check the option to assign locally [23:08:31] Amir1, i forgot already ;P [23:09:06] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T238105 (duration: 00m 51s) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:20] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: T238105 (duration: 00m 52s) [23:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:25] !log ladsgroup@deploy1001 Synchronized langlist: T238105 (duration: 00m 50s) [23:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:30] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [23:12:12] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552153 [23:12:14] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552153 (owner: 10Ladsgroup) [23:12:59] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552153 (owner: 10Ladsgroup) [23:14:07] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 24s) [23:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:28] !log finished creating five wikis, total duration 134 minutes [23:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:34] RoanKattouw: the floor is yours [23:14:46] sorry for this [23:14:52] (03CR) 10Dzahn: [C: 03+2] xhgui::app: add support for buster/PHP7.3 [puppet] - 10https://gerrit.wikimedia.org/r/552126 (https://phabricator.wikimedia.org/T238788) (owner: 10Dzahn) [23:15:35] (03CR) 10Dzahn: "ack, thanks for pointing that out" [puppet] - 10https://gerrit.wikimedia.org/r/552126 (https://phabricator.wikimedia.org/T238788) (owner: 10Dzahn) [23:17:12] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Ladsgroup) The wiki has been created [23:17:22] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Ladsgroup) The Wiki has been created. [23:17:49] (03PS11) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [23:18:19] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [23:18:58] (03CR) 10EBernhardson: "Latest PS makes two changes:" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [23:19:01] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Aklapper) [23:19:26] (03PS1) 10CRusnov: netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) [23:20:14] 10Operations, 10SRE-tools, 10netbox, 10Patch-For-Review: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10crusnov) The above patch should address these issues. It hugely simplifies the nagios check script and also uses the API more efficiently so it shouldn't flap anymore... [23:20:43] I would love to run the script to add wikidata but I'm too tired and need to catch a train [23:22:37] (03CR) 10jerkins-bot: [V: 04-1] netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [23:23:04] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MarcoAurelio) [23:23:20] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Aklapper) If this gets done, potential steps afterwards could be * declining the Phab tasks https://phabr... [23:24:46] @Jhs I'm getting off - can you post to T235885 once you've done the imports? [23:24:47] T235885: DannyS712 wasn't attached to SUL at banwiki - https://phabricator.wikimedia.org/T235885 [23:25:06] I stay online with my phone, in case things go south [23:25:26] DannyS712, sure. but i won't finish today though, there are a looot of pages in wiktionaries [23:25:49] I also haven't edited the wiktionaries - I just need something with a revision that is assigned to me [23:26:02] Thanks, --~~~~ [23:26:39] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) The Phab tasks contain some lessons learned. I agree they should be declined, but those le... [23:26:54] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MarcoAurelio) @Aklapper re. point 3: @CCicalese_WMF above mentions that she does not want those extension... [23:27:21] (03PS2) 10CRusnov: netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) [23:28:41] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [23:29:44] (03CR) 10jerkins-bot: [V: 04-1] netbox report alerting: Simplify icinga check and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/552154 (https://phabricator.wikimedia.org/T237803) (owner: 10CRusnov) [23:30:49] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: [Mailing lists] Received 205 bounce action notification emails from mailman in 20 minutes - https://phabricator.wikimedia.org/T238780 (10Aklapper) @mforns: Please provide more information, especially if there is any pattern with regard to email addresses. (all... [23:37:36] (03CR) 10Dzahn: [C: 04-1] "block parameter 'user' expects a String value, got Tuple" [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn) [23:46:21] (03PS2) 10Dzahn: phabricator: write my.cnf for db access into each admin home dir [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) [23:52:36] (03PS1) 10Catrope: GrowthExperiments: Enable suggested edits without opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552156 [23:54:27] (03PS3) 10Dzahn: phabricator: write my.cnf for db access into each admin home dir [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) [23:54:38] (03PS2) 10Catrope: GrowthExperiments: Enable suggested edits without opt-in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552156 (https://phabricator.wikimedia.org/T227728) [23:57:06] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19523/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn)