[00:21:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:22:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:34:09] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:21] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:05] RoanKattouw: James_F: OK if I deploy https://gerrit.wikimedia.org/r/535739 and other one? [01:09:24] https://phabricator.wikimedia.org/T232521#5481615 * [01:09:30] I don't object to that, I just don't know how to test and I'm about to leave [01:09:53] OK. I'm just noticing clicking images on mw.org leads to a black screen [01:09:59] I'm assuming this will fix that, and will test for that. [01:10:01] no worries [01:14:38] !log cp[13]xxx: temporarily setting advmss on eqiad and esams caches to 1436 for GRE MTU compat - T232491 [01:16:42] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10BBlack) I've made a temporary MTU-related fixup on the affected eqiad and esams cache hosts. Assuming we understand the... [01:23:09] * Krinkle staging on mwdebug1002 [01:26:17] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/MultimediaViewer/resources/mmv: T232521 - 5ffac2be882 (duration: 01m 07s) [01:29:31] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/3D/modules/: T232521 - 9732c6e81e (duration: 01m 03s) [01:38:19] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:13] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:41] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test pag [01:47:41] expected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:49:15] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:59:41] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:10:15] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:19:15] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:38:27] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:06] interesting [02:46:21] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:43] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:47:47] (03PS1) 104nn1l2: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 [02:52:33] PROBLEM - SSH labweb1001.mgmt on labweb1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:48] (03PS2) 104nn1l2: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) [03:02:54] (03PS1) 10CRusnov: profile::mariadb::ferm_misc: Add netbox host access [puppet] - 10https://gerrit.wikimedia.org/r/535750 [03:15:55] (03PS2) 10CRusnov: netbox: Various remaining fixes. [puppet] - 10https://gerrit.wikimedia.org/r/535750 [03:36:21] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:38:35] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:31] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:31] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:53:01] RECOVERY - SSH labweb1001.mgmt on labweb1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:45] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10John_of_Reading) Thank you! I've successfully previewed and edited in Firefox. I've also saved an edit in AWB, which had... [04:37:19] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:15] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:03] 10Operations, 10DBA: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) [05:13:40] 10Operations, 10DBA: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) p:05Triage→03Normal This host is no longer m1 master {T231403}, but let's wait a few days before decommissioning it [05:13:52] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:14:25] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:17:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] dbctl: use explicit keyword arguments for the callback [software/conftool] - 10https://gerrit.wikimedia.org/r/534818 (owner: 10CDanis) [05:19:28] (03PS1) 10Marostegui: mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/535764 (https://phabricator.wikimedia.org/T230106) [05:20:29] (03Merged) 10jenkins-bot: dbctl: use explicit keyword arguments for the callback [software/conftool] - 10https://gerrit.wikimedia.org/r/534818 (owner: 10CDanis) [05:29:01] !log Switchover s1 codfw master db2048 -> db2112 T230106 [05:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:04] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [05:32:45] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:34:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/535764 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:37:33] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:17] (03PS1) 10Marostegui: db1063: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/535767 (https://phabricator.wikimedia.org/T232564) [05:41:00] (03CR) 10Marostegui: [C: 03+2] db1063: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/535767 (https://phabricator.wikimedia.org/T232564) (owner: 10Marostegui) [05:41:37] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) [05:43:19] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:45:27] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:53] RECOVERY - traffic_server tls process restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [05:47:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 codfw master T230106', diff saved to https://phabricator.wikimedia.org/P9079 and previous config saved to /var/cache/conftool/dbconfig/20190911-054753-marostegui.json [05:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:57] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [05:48:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2048, will be decommissioned T230106', diff saved to https://phabricator.wikimedia.org/P9080 and previous config saved to /var/cache/conftool/dbconfig/20190911-054855-marostegui.json [05:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:07] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:07:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: add set-candidate-master subcommand on instance (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [06:10:22] (03CR) 10Volans: "Post-merge -1, see inline. Also if the cumin masters for cloud are now within the cloud network and no bastion/prpxy is used, you probably" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [06:16:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: add set-note instance subcommand [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [06:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Re-organize s1 codfw weights and roles - T230106', diff saved to https://phabricator.wikimedia.org/P9081 and previous config saved to /var/cache/conftool/dbconfig/20190911-061659-marostegui.json [06:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:04] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [06:17:59] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Re-organize s1 codfw weights and roles - T230106', diff saved to https://phabricator.wikimedia.org/P9082 and previous config saved to /var/cache/conftool/dbconfig/20190911-061924-marostegui.json [06:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:39] (03CR) 10Volans: [C: 03+2] Inject device hostname [software/homer] - 10https://gerrit.wikimedia.org/r/535720 (owner: 10Ayounsi) [06:25:15] (03Merged) 10jenkins-bot: Inject device hostname [software/homer] - 10https://gerrit.wikimedia.org/r/535720 (owner: 10Ayounsi) [06:26:16] (03CR) 10jenkins-bot: Inject device hostname [software/homer] - 10https://gerrit.wikimedia.org/r/535720 (owner: 10Ayounsi) [06:27:27] (03PS2) 10Elukey: aptrepo: change the amd-rocm27 component to amd-rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535646 [06:27:29] (03CR) 10Volans: [C: 03+2] "Sorry, picked.+2 instead of reply. Anyway LGTM, I would.have kept fqdn as name but as discussed on IRC all the templates already use hostn" [software/homer] - 10https://gerrit.wikimedia.org/r/535720 (owner: 10Ayounsi) [06:31:27] (03CR) 10Volans: [C: 03+1] "LGTM. Optionally make sure we test it both with and without config to ensure we keep the behaviour going forward." [software/homer] - 10https://gerrit.wikimedia.org/r/535722 (owner: 10Ayounsi) [06:37:21] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:03] (03CR) 10Elukey: "Thanks a lot for the review!" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [06:43:48] (03CR) 10Volans: "Correction inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [06:44:47] (03PS3) 10Noa wmde: TR: Configure a feature flag for Wikibase Tainted References [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) [06:45:10] (03CR) 10Elukey: "Sorry, meant to say in my last comment:" [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [06:45:18] !log Drop unused database puppet on m1 - T231539 [06:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:23] T231539: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 [06:45:24] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Mathew.onipe) [06:46:35] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:45] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) 05Open→03Resolved This is all done [06:50:15] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [06:50:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [06:50:40] we are going to restart it, please don't do it now [06:50:49] (for any opsen reading) [06:54:53] PROBLEM - SSH labweb1001.mgmt on labweb1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:56:21] (03PS1) 10Effie Mouzeli: 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535770 (https://phabricator.wikimedia.org/T219150) [06:58:16] !log Restarting Gerrit [06:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:24] I wanted to take some traces [07:00:43] !log Restarting Gerrit - T224448 [07:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:51] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [07:01:09] PROBLEM - Check the last execution of git_pull_charts on contint2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:01:23] ^ expected [07:01:27] PROBLEM - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:02:09] (03PS4) 10Elukey: Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) [07:02:47] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.047 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:02:59] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26353 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 to reboot for kernel upgrade T230785', diff saved to https://phabricator.wikimedia.org/P9083 and previous config saved to /var/cache/conftool/dbconfig/20190911-070635-marostegui.json [07:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:39] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [07:07:18] !log Stop MySQL on db1122 to reboot for a kernel upgrade T230785 [07:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:41] RECOVERY - Check the last execution of git_pull_charts on contint2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:11:59] RECOVERY - Check the last execution of git_pull_charts on deploy2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:12:42] (03PS1) 10Effie Mouzeli: 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535786 (https://phabricator.wikimedia.org/T219150) [07:14:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1122', diff saved to https://phabricator.wikimedia.org/P9084 and previous config saved to /var/cache/conftool/dbconfig/20190911-071450-marostegui.json [07:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:17:16] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [07:23:19] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 54.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1122', diff saved to https://phabricator.wikimedia.org/P9085 and previous config saved to /var/cache/conftool/dbconfig/20190911-072344-marostegui.json [07:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:04] (03CR) 10Muehlenhoff: "What about labtestpuppetmaster2001? That'll probably also go away entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [07:26:49] (03PS1) 10Muehlenhoff: Switch labpuppetmaster* to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/535789 [07:27:31] (03CR) 10Elukey: [C: 04-1] "Precautionary -1 to verify my questions before merging :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [07:29:35] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 70 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1122', diff saved to https://phabricator.wikimedia.org/P9086 and previous config saved to /var/cache/conftool/dbconfig/20190911-073335-marostegui.json [07:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:03] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:29] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1122', diff saved to https://phabricator.wikimedia.org/P9087 and previous config saved to /var/cache/conftool/dbconfig/20190911-075139-marostegui.json [07:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:50] !log reimaging restbase-dev1005 to Stretch T224554 [07:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:53] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [07:54:48] (03CR) 10Elukey: [C: 03+2] aptrepo: change the amd-rocm27 component to amd-rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535646 (owner: 10Elukey) [07:55:17] RECOVERY - SSH labweb1001.mgmt on labweb1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1122', diff saved to https://phabricator.wikimedia.org/P9088 and previous config saved to /var/cache/conftool/dbconfig/20190911-080450-marostegui.json [08:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:16] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [08:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] !log execute reprepro clearvanished on install1002 to clear buster-wikimedia|thirdparty/amd-rocm27 (not used anymore) [08:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:32] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:41] !log add thirdparty/amd-rocm271 to buster-wikimedia and update it with ROCm 2.7.1 packages [08:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] (03PS1) 10Elukey: profile::statistics::gpu: update ROCm version to 2.7.1 [puppet] - 10https://gerrit.wikimedia.org/r/535791 [08:17:39] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: update ROCm version to 2.7.1 [puppet] - 10https://gerrit.wikimedia.org/r/535791 (owner: 10Elukey) [08:19:28] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@56a8342]: Stop pregenerating enwiktionary page/definition - T231361 [08:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:34] T231361: Stop pregenerating and storing /page/definition responses - https://phabricator.wikimedia.org/T231361 [08:21:27] (03PS1) 10Elukey: amd_rocm: add support for 2.7.1 [puppet] - 10https://gerrit.wikimedia.org/r/535792 [08:22:13] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@56a8342]: Stop pregenerating enwiktionary page/definition - T231361 (duration: 02m 45s) [08:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:55] (03CR) 10Elukey: [C: 03+2] amd_rocm: add support for 2.7.1 [puppet] - 10https://gerrit.wikimedia.org/r/535792 (owner: 10Elukey) [08:23:51] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [08:24:25] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@069d297]: Revert Stop pregenerating enwiktionary page/definition [08:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:59] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@069d297]: Revert Stop pregenerating enwiktionary page/definition (duration: 00m 34s) [08:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:24] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) restbase1005-dev is now running Stretch and good to bootstrap. [08:30:11] (03PS2) 10Giuseppe Lavagetto: envoyproxy: fix settings for jessie [puppet] - 10https://gerrit.wikimedia.org/r/534401 [08:32:43] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:32:51] (03CR) 10Gehel: [C: 04-1] "I'm not entirely sure of the approach. See comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:34:45] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@7a8ab89]: Stop pregenerating enwiktionary page/definition, take #2 - T231361 [08:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:48] T231361: Stop pregenerating and storing /page/definition responses - https://phabricator.wikimedia.org/T231361 [08:35:39] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 87.49 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:36:17] (03PS3) 10Giuseppe Lavagetto: envoyproxy: fix settings for jessie [puppet] - 10https://gerrit.wikimedia.org/r/534401 [08:36:26] (03PS2) 10Muehlenhoff: On Ganeti servers print the current master node in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/535150 [08:36:59] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@7a8ab89]: Stop pregenerating enwiktionary page/definition, take #2 - T231361 (duration: 02m 13s) [08:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:29] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:57] (03PS1) 10Elukey: aptrepo: update whitelist for thirdparty/rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535795 [08:38:51] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Upgraded stat1005 with ROCm 2.7.1, from my tests everything lo... [08:38:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: fix settings for jessie [puppet] - 10https://gerrit.wikimedia.org/r/534401 (owner: 10Giuseppe Lavagetto) [08:39:23] (03PS2) 10Elukey: aptrepo: update whitelist for thirdparty/rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535795 [08:40:38] (03CR) 10Elukey: [C: 03+2] aptrepo: update whitelist for thirdparty/rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535795 (owner: 10Elukey) [08:41:02] (03PS3) 10Muehlenhoff: On Ganeti servers print the current master node in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/535150 [08:43:57] 10Operations, 10Traffic, 10observability: varnish request rates showed a spike up while nginx request rates didn't - https://phabricator.wikimedia.org/T232574 (10fgiunchedi) [08:44:53] (03CR) 10Muehlenhoff: [C: 03+2] On Ganeti servers print the current master node in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/535150 (owner: 10Muehlenhoff) [08:45:15] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:30] (03PS1) 10Muehlenhoff: Fix motd script [puppet] - 10https://gerrit.wikimedia.org/r/535801 [09:07:40] (03CR) 10jerkins-bot: [V: 04-1] Fix motd script [puppet] - 10https://gerrit.wikimedia.org/r/535801 (owner: 10Muehlenhoff) [09:07:57] 10Operations, 10Traffic, 10observability: Alert in case of significant discrepancies between the number of nginx and varnish responses - https://phabricator.wikimedia.org/T232574 (10ema) p:05Triage→03Normal [09:08:32] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints - T231361 T232449 [09:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:37] T231361: Stop pregenerating and storing /page/definition responses - https://phabricator.wikimedia.org/T231361 [09:08:37] T232449: Expose new PCS javascript endpoints: pagelib_body_start and pagelib_body_end - https://phabricator.wikimedia.org/T232449 [09:11:41] (03PS2) 10Muehlenhoff: Fix motd script [puppet] - 10https://gerrit.wikimedia.org/r/535801 [09:11:55] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints - T231361 T232449 (duration: 03m 24s) [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:00] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #2 [09:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:34] (03CR) 10Muehlenhoff: [C: 03+2] Fix motd script [puppet] - 10https://gerrit.wikimedia.org/r/535801 (owner: 10Muehlenhoff) [09:14:51] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10User-Addshore: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10ema) >>! In T189333#5481492, @Krinkle wrote: > I re-ran my analysis today, and oddly enough the total number of fields it not only s... [09:16:59] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #2 (duration: 03m 59s) [09:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:24] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #3 [09:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Let's do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535786 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:27:56] (03CR) 10Mobrovac: Add $ensure params with defaults for eventlogging service - no op (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [09:29:38] (03CR) 10Effie Mouzeli: [C: 03+2] 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535786 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:29:53] (03Abandoned) 10Giuseppe Lavagetto: eventschemas: use safe service restart script [puppet] - 10https://gerrit.wikimedia.org/r/518671 (owner: 10Giuseppe Lavagetto) [09:30:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 56.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:31:40] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 109.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:32:41] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #3 (duration: 13m 18s) [09:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] (03Merged) 10jenkins-bot: 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535786 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:32:55] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #3a [09:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:11] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:08] sorry this is me --^ [09:35:16] it is a host not yet serving traffic [09:35:22] I am testing console redirection [09:36:11] (03PS1) 10Muehlenhoff: Really fix motd [puppet] - 10https://gerrit.wikimedia.org/r/535809 [09:36:13] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 16%, RTA = 0.29 ms [09:37:13] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:37:19] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:38:43] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:39:42] (03CR) 10Muehlenhoff: [C: 03+2] Really fix motd [puppet] - 10https://gerrit.wikimedia.org/r/535809 (owner: 10Muehlenhoff) [09:43:05] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:43:42] I think there is a task for that alert [09:43:44] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) Making a summary after a bit of time of the current status: * While setting up the hosts, I noticed that console redirection didn't work, and appli... [09:44:35] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:45:11] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@cf2ca76]: Stop using storage for enwiktionary definition and expose new PCS javascript endpoints, take #3a (duration: 12m 15s) [09:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:23] yup three is, effie [09:46:41] you created it :) [09:46:46] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Push PHP7 traffic to 50% - T219150 (duration: 01m 03s) [09:46:47] ah right [09:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:49] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [09:49:44] (03PS1) 10Jcrespo: Update stretch mariadb wmf package to 10.1.41 [software] - 10https://gerrit.wikimedia.org/r/535810 [09:55:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:56:34] (03CR) 10jenkins-bot: 50% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535786 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:56:44] !log upgrading mariadb client libary on mariadb root clients [09:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] (03PS2) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 [10:01:12] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 (owner: 10Giuseppe Lavagetto) [10:01:56] !log stopping and upgrading db1074 [10:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:45] (03PS1) 10Ema: planet: add *.planet.wikimedia.org to SubjAltName [puppet] - 10https://gerrit.wikimedia.org/r/535813 (https://phabricator.wikimedia.org/T210411) [10:11:59] (03PS4) 10Noa wmde: TR: set WikibaseTaintedReferencesEnabled true on labs wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) [10:13:40] jynus: once the new bbu for db1074 is bought we should upgrade sanitarium, labs and move replication back under db1074 [10:14:04] yeah, when maintenance is done [10:14:17] I just wanted to upgrade so it is ready to go down [10:14:34] yeah, definitely! [10:14:35] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [10:17:00] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [10:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `tureis.codfw.wmnet` - tureis.codfw.wmnet - Removed from Puppet mast... [10:18:25] (03CR) 10Ema: [C: 03+2] planet: add *.planet.wikimedia.org to SubjAltName [puppet] - 10https://gerrit.wikimedia.org/r/535813 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:18:41] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `roentgenium.eqiad.wmnet` - roentgenium.eqiad.wmnet - Removed from P... [10:23:02] !log removed roentgenium/tureis in Ganeti T224559 [10:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:09] T224559: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 [10:23:27] (03PS2) 10Muehlenhoff: Remove DNS entries for roentgenium/tureis [dns] - 10https://gerrit.wikimedia.org/r/534019 (https://phabricator.wikimedia.org/T224559) [10:29:28] effie: 💃 [10:29:41] haha [10:30:09] hey Amir1edit issue confirmed unrelated, but I may have discover the cause [10:31:12] we may have a large bot editing Amir1: https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&from=1567593056471&to=1568197856471 [10:31:53] I can block or give a warning if it's causign trouble [10:31:58] actually [10:32:02] it is good news [10:32:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for roentgenium/tureis [dns] - 10https://gerrit.wikimedia.org/r/534019 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [10:32:36] because if there is higher amount of contention because there is higher amount of edits [10:32:40] https://www.wikidata.org/wiki/Special:Contributions/LargeDatasetBot [10:32:46] It seems the user [10:32:56] it is not a huge issue, e.g. if it is only afecting itself [10:33:33] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10MoritzMuehlenhoff) 05Open→03Resolved New instances (failoid1001 and failoid2001) have been set up with Buster and are in use. The old instances (roentgenium... [10:33:35] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:33:47] just wanted to give a heads up, my only worry was normal users getting affected [10:33:54] but it is probably not the vase [10:33:56] *case [10:34:16] (03PS1) 10Ema: peopleweb: add people.wikimedia.org to SubjAltName [puppet] - 10https://gerrit.wikimedia.org/r/535814 (https://phabricator.wikimedia.org/T210411) [10:34:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:34:54] People should not create 5 items per second, that disrupts others' work [10:35:14] I agree, we should probably advice to redunce the speed [10:35:21] even if it is not causing infra issues [10:35:30] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:35:31] because it takes away resources from other ussers [10:35:40] and we have done the same in the past [10:36:19] so no block because there is no ongoing issues [10:36:27] but a comment would be ok [10:36:29] (03PS3) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 [10:37:03] jynus, Amir1: FYI we now also have the option to block IPs at the CDN level https://wikitech.wikimedia.org/wiki/Varnish#Blacklist_an_IP in case that becomes necessary [10:37:09] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:39] Amir1: slighly unrelated, could you check with wikibase team if they checked what we commented on running out of autoinc ids? [10:37:46] *slightly related [10:37:57] maybe it won't be a problem with the new model [10:38:01] but I didn't check [10:38:31] I belive the main issue is probably rc_id [10:39:08] Sure, I haven't seen the phab ticket. [10:39:10] (03CR) 10Ema: [C: 03+2] peopleweb: add people.wikimedia.org to SubjAltName [puppet] - 10https://gerrit.wikimedia.org/r/535814 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:39:37] I think we are on edit ~1011555477 [10:39:59] so not an imminent worry, but something to take into account [10:40:21] what's the limit? [10:40:36] our tech-lead didn't know about it [10:40:53] let me chech, because there is also changes on the comment refactor [10:42:12] max_value: 2147483647 [10:43:20] just to be clear, wikibase shouldn't be responsable to change rcs (for example) but they should join me on promoting supporing large number of edits [10:43:43] I see [10:44:06] I think it'll take a couple years until we get there [10:44:22] jynus: Sent the user a message [10:44:38] well, less time with bots like that one :-D [10:44:44] thanks, Amir1! [10:45:03] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:14] lol, true [10:45:32] also altering large tables could take a few months [10:46:16] https://phabricator.wikimedia.org/P8198 I have this but I don't think I have a ticket [10:46:19] I will create one [10:46:39] @jouncebot next [10:46:46] meh don't remember how to ping the bot [10:46:47] there is https://phabricator.wikimedia.org/T62962 [10:46:56] and https://phabricator.wikimedia.org/T63111 [10:47:00] so there are tickets already [10:47:12] (03PS1) 10Ladsgroup: Set item terms on write both up to Q10mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535815 (https://phabricator.wikimedia.org/T225055) [10:49:02] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10akosiaris) [10:50:00] (03CR) 10Ladsgroup: [C: 04-1] TR: set WikibaseTaintedReferencesEnabled true on labs wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [10:51:11] T62962 will buy us a couple of years [10:51:11] T62962: The primary key of recentchanges (rc_id) table should be unsigned - https://phabricator.wikimedia.org/T62962 [10:51:21] yeah [10:51:28] I think I would do that for now [10:51:38] the other are only 25% filled after 5 years [10:52:35] and unsigned would avoid compatiblity problems [10:55:40] once the bot reduced its rate I would be ok with retying the conversion [10:55:47] the config [10:57:24] !log drop the wiktionary definition keyspace - T231361 [10:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:27] T231361: Stop pregenerating and storing /page/definition responses - https://phabricator.wikimedia.org/T231361 [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1100). [11:00:04] noa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] ACKNOWLEDGEMENT - HP RAID on db1074 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T232592 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:00:15] (03CR) 10Jbond: "the labspuppetmasters have `puppetdb-termini` installed however the 5.5 version has not been packaged for jessie. However i don't think t" [puppet] - 10https://gerrit.wikimedia.org/r/535789 (owner: 10Muehlenhoff) [11:00:17] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T232592 (10ops-monitoring-bot) [11:00:26] o/ [11:00:44] I'm here for my two patches [11:00:45] o/ [11:00:47] really the same one [11:00:56] hmm... I'm not pinged, did I forget add it to deployments? [11:01:02] o/ [11:01:21] raynor: I don’t see you in the deployment calendar [11:01:31] * raynor facepalm, yeah... let me add one task, sorry [11:01:34] but Amir1 and apergos weren’t pinged either, so I guess jouncebot wasn’t reloaded in time? [11:01:37] jouncebot: reload [11:01:41] (I think that’s the command?) [11:01:48] refresh? [11:01:49] one sec, adding one task [11:01:58] though it’s too late for the pings now anyways [11:02:03] jouncebot: refresh [11:02:04] I refreshed my knowledge about deployments. [11:02:09] thanks Amir1 [11:03:38] (03CR) 10Ladsgroup: [C: 03+1] "isset is there. Sorry for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [11:04:05] (03CR) 10Muehlenhoff: "puppetdb-termini has been installed by puppetmaster::puppetdb::client, but that class has been dropped in https://gerrit.wikimedia.org/r/#" [puppet] - 10https://gerrit.wikimedia.org/r/535789 (owner: 10Muehlenhoff) [11:04:07] jouncebot: refresh [11:04:07] I refreshed my knowledge about deployments. [11:04:14] added mine, sorry for delay [11:05:01] jouncebot: now [11:05:01] For the next 0 hour(s) and 54 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1100) [11:05:16] anyway, shall we begin? [11:05:51] I can some of them [11:05:56] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [11:06:49] (03Merged) 10jenkins-bot: TR: set WikibaseTaintedReferencesEnabled true on labs wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [11:07:07] (03CR) 10jenkins-bot: TR: set WikibaseTaintedReferencesEnabled true on labs wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [11:07:45] thanks Amir1 :) [11:07:59] (obviously there's nothing to see or check) [11:08:01] one of mine was merged and one was not for whatever reason. thanks for the merge [11:09:39] apergos: the backport takes quite some time, I +2'd one already [11:09:51] the other one is already merged [11:09:55] it could go around now [11:10:08] tarrow: noa_wmde : It will go live automagically in a couple of minutes, we can't do anything about it in SWAT [11:10:17] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:535567|TR: set WikibaseTaintedReferencesEnabled true on labs wikidatawiki (T232191)]] (duration: 01m 03s) [11:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:23] T232191: Create a global config switch to use as a feature flag for tainted references - https://phabricator.wikimedia.org/T232191 [11:10:45] No hurry (right now it's not actually turning anything on; just trying to get ahead for later today :) ) [11:11:02] cool [11:11:39] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535815 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:12:49] (03Merged) 10jenkins-bot: Set item terms on write both up to Q10mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535815 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:13:04] (03CR) 10jenkins-bot: Set item terms on write both up to Q10mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535815 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:15:24] jynus: marostegui This is going live ^ That would enable new term store but for smaller portion of wikidata so we don't end up with deadlock when wbt_text is being built [11:15:28] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:535815|Set item terms on write both up to Q10mio (T225055)]] (duration: 01m 03s) [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:30] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [11:15:55] thanks, Amir1 [11:16:15] apergos: this is master: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/535200 [11:16:44] ah woops [11:16:50] there is a backprot to 22 I think [11:17:14] apergos: I don't think it's needed, it seems it got merged before the branch cut [11:17:20] oh, whew [11:17:29] good enough then, that's why I see 'included in 22' [11:17:39] https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.22/Changelog [11:17:46] git #3a39f364 - maintenance/getReplicaServer.php: Remove reference to long-deleted config var (task T232268) by Brad Jorsch [11:17:47] T232268: All dumps are broken by MW change which breaks getReplicaServer.php - https://phabricator.wikimedia.org/T232268 [11:17:52] It's there already [11:18:03] so only wmf.21 [11:18:09] good good [11:18:22] only group 0 is on 22 right now, right? [11:18:24] * apergos checks [11:18:44] yeah 0 so it's ok [11:18:56] raynor: you're next until the backport gets merged [11:19:05] (03PS2) 10Ladsgroup: Disable AMC Outreach modal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535615 (https://phabricator.wikimedia.org/T231436) (owner: 10Pmiazga) [11:19:08] let me edit deployments to remove the one patch then [11:19:17] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535615 (https://phabricator.wikimedia.org/T231436) (owner: 10Pmiazga) [11:19:20] Amir: thx, I'm waiting [11:20:22] (03Merged) 10jenkins-bot: Disable AMC Outreach modal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535615 (https://phabricator.wikimedia.org/T231436) (owner: 10Pmiazga) [11:20:39] (03CR) 10jenkins-bot: Disable AMC Outreach modal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535615 (https://phabricator.wikimedia.org/T231436) (owner: 10Pmiazga) [11:20:59] raynor: live at mwdebug1002 [11:22:00] Amir1: checking [11:25:53] I rebased this time :D [11:26:37] Amir1, works [11:26:55] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/535789 (owner: 10Muehlenhoff) [11:27:32] raynor: okay, going live [11:29:01] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:535615|Disable AMC Outreach modal (T231436)]] (duration: 01m 04s) [11:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:08] T231436: Turn off AMC outreach modal - https://phabricator.wikimedia.org/T231436 [11:30:17] I go grab a coffee until the backport gets merged [11:30:40] mine is [11:30:46] but I can wait for your coffee [11:33:09] (03PS1) 10Jcrespo: monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 [11:33:48] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [11:35:38] (03PS2) 10Jcrespo: monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 [11:37:22] raynor: Yours is live btw [11:37:46] Amir1, thx, everything works [11:38:04] yup, I noticed sync log [11:38:19] apergos: It's live in mwdebug1002 if it's testable [11:38:29] tested there, works [11:38:31] thank you [11:38:37] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:11] (03CR) 10Jcrespo: "I know this is technically wrong (the directory could be created and the enabling fails), but the possibility of failure is so remote that" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [11:40:56] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.21/maintenance/getReplicaServer.php: SWAT: [[gerrit:535217|maintenance/getReplicaServer.php: Remove reference to long-deleted config var (T232268)]] (duration: 01m 04s) [11:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:59] T232268: All dumps are broken by MW change which breaks getReplicaServer.php - https://phabricator.wikimedia.org/T232268 [11:41:46] !log EU SWAT is done [11:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] \o/ [11:44:57] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:16] gerrit seems down to me [11:47:41] same here, can ping but not connect [11:48:25] (03PS3) 10Jcrespo: monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 [11:48:54] tls handshake never completed. huh [11:49:02] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:49:03] hmm [11:49:33] beside someone fetching all repositories [11:49:38] I dont see much more details [11:49:38] bah [11:49:59] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:50:11] ssh still working so far [11:50:23] yeah that is just http threads being blocked I believe [11:50:32] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) 05Open→03Resolved Archives no longer public. [11:56:23] same as this morning :-\ [11:57:31] !log applying GRE MTU -> MSS fixup to cobalt and gerrit2001 - T218184 [11:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:35] T218184: Update apertium-nno-nob, apertium-swe-dan, apertium-swe-nor and apertium-dan-nor packages - https://phabricator.wikimedia.org/T218184 [11:57:48] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T232592 (10Marostegui) 05Open→03Invalid This is being handled at T231638 [11:58:01] hashar: yeah I see multiple timetouts on https updating repos [11:58:10] try again with fresh connections! [11:58:19] (gerrit) [11:59:41] !log Restarting Gerrit due to deadlock in the account cache # T224448 [11:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:45] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [11:59:48] or that too [11:59:59] yeah :-\\\ [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1200) [12:00:18] there is some lock that is not being released [12:00:43] so threads serving http traffic eventually pile sup until the pool is exhausted [12:00:58] (it is in Gerrit, not apache, not traffic caches) [12:01:27] hashar: I think the MTU fix I made was also causing some of this, and it could be that the MTU problem was causing lots of stalled/hung connections at the TCP level, which would in turn naturally incease some lock contention and cause your effects too [12:01:28] Nikerabbit Amir1 apergos Lucas_WMDE : gerrit http should work again now (I have restarted it) [12:01:41] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.134 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [12:02:03] Thanks [12:02:10] yep loads for me [12:02:37] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26003 bytes in 0.589 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:02:47] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:02:49] thanks hashar [12:09:55] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) p:05Triage→03Normal [12:10:12] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10jbond) p:05Triage→03Normal [12:10:38] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) [12:13:23] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:15:57] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10jbond) p:05Normal→03High a:03Cmjohnson [12:17:21] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Netbox [12:18:55] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 349 bytes in 7.395 second response time https://wikitech.wikimedia.org/wiki/Netbox [12:20:03] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:25:27] (03PS1) 10Urbanecm: Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) [12:26:10] jouncebot: next [12:26:10] In 0 hour(s) and 33 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1300) [12:26:11] jouncebot: now [12:26:11] For the next 0 hour(s) and 33 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1200) [12:27:07] (03CR) 10jerkins-bot: [V: 04-1] Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) (owner: 10Urbanecm) [12:28:38] (03CR) 10Marostegui: "Is this what you tested on db1074 already?" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [12:32:15] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:38:37] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:58] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [12:40:03] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10jbond) @greg are you able to approve this access request [12:40:42] !log removing now puppet/puppetdb packages from labpuppetmaster* T171188 [12:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:45] T171188: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 [12:40:51] !log removing now obsolete puppet/puppetdb packages from labpuppetmaster* T171188 [12:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:46] (03PS2) 10Muehlenhoff: Switch labpuppetmaster* to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/535789 [12:46:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch labpuppetmaster* to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/535789 (owner: 10Muehlenhoff) [12:46:27] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10jbond) Hi Greg, Currently the [[ https://wikitech.wikimedia.org/wiki/Mailman#Step-by-step_procedure_using_the_UI | process to create a mailing list]] is all done using the UI i.e... [12:46:31] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:10] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10akosiaris) I wonder how was though git-lfs populated in the first place on the old worker nodes? manually perhaps? Anyway, one clean way out of t... [12:48:43] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [12:48:53] !log upgrade labpuppetmaster* to use facter 3 / puppet 5 [12:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:03] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) John Bond https://phabricator.wikimedia.org/T232591 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:56:42] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10User-Addshore: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) >>! In T189333#5481492, @Krinkle wrote: > I re-ran my analysis today, and oddly enough the total number of fields it not... [13:00:04] hashar: (Dis)respected human, time to deploy MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1300). Please do the needful. [13:02:10] (03PS1) 10Marostegui: mariadb: Promote db1122 as s2 primary master [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) [13:02:48] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [13:03:34] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [13:03:40] (03PS1) 10Hashar: group1 wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535840 [13:03:42] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535840 (owner: 10Hashar) [13:03:59] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10Aklapper) [13:04:05] (03PS1) 10Jbond: ganeti: add netbox to ganate api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 [13:04:24] (03PS1) 10Marostegui: wmnet: Change s2 CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) [13:04:42] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535840 (owner: 10Hashar) [13:04:52] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [13:06:26] (03CR) 10Marostegui: [C: 03+1] "Thanks for answering my questions!" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [13:06:38] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535840 (owner: 10Hashar) [13:09:10] (03PS2) 10Jbond: ganeti: add netbox to ganate api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 [13:09:14] (03CR) 10Muehlenhoff: "This also needs netbox2001.wikimedia.org I think" [puppet] - 10https://gerrit.wikimedia.org/r/535841 (owner: 10Jbond) [13:09:46] (03CR) 10Marostegui: "Let's try to talk about this next week?" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [13:10:19] syncing [13:11:16] (03CR) 10Muehlenhoff: [C: 04-1] "Ack, it's on my TODO list, I won't merge this for sure :-)" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [13:11:42] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10Kelson) @Aklapper Happy you tackle that issue. I wanted to open that ticket myself for quite a long time already. You can put me in the list of admins. [13:12:08] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.22 [13:12:12] (03PS4) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 [13:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.22 (duration: 01m 02s) [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:46] 13:12:57 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.04, After: 2.00, Threshold: 1.00) [13:13:51] the only canary that failed the threshold [13:14:48] * hashar whistles [13:15:32] includes/libs/rdbms/lbfactory/LBFactoryMulti.php: PHP Notice: Undefined index: [13:15:34] bah [13:18:10] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: add netbox to ganate api hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535841 (owner: 10Jbond) [13:18:27] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10StephaneKiwix) Same here. You can add me (and add yourself, the more the merrier). [13:20:17] (03PS3) 10Jbond: Ganeti: add netbox to Ganeti api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 [13:20:57] (03PS5) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 [13:22:22] dcausse: amir1: not sure whom to ping beside you :D WikibaseCirrusSearch emits some cirrussearch-too-busy-error messages https://phabricator.wikimedia.org/T232612 [13:22:35] happened with 1.34.0-wmf.21 as well and started roughly half an hour ago [13:22:48] that's not Wikibase [13:23:09] I mean it's probably dcausse territory [13:23:15] it was a 50-50% chance if one didn't know :-D [13:23:18] hashar: it's someone using the wbsearchentity a bit aggressively I think [13:24:23] (03PS7) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [13:24:56] dcausse: cool.Would you mind adjusting the list of projects on the task and set a priority of some sort? :] Does not seem to be a train blocker to me anyway [13:25:00] merci! [13:25:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18240/ununpentium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/529920 (owner: 10Giuseppe Lavagetto) [13:27:09] some curious search strings… I wonder if a third party is now directly forwarding searches to wbsearchentities? [13:27:27] jynus: the user hasn't answered in https://www.wikidata.org/wiki/User_talk:GZWDer, Do you want me to block the bot for now? [13:27:36] not sure why someone would search for the term in XXj1TwpAAEkAADZnu64AAACB on Wikidata, for example [13:28:09] personally, Amir1I would ask someone on the Village Pump [13:28:24] but I am not an admin there [13:28:28] so up to you [13:28:31] Amir1: perhaps get another admin’s opinion in #wikidata? [13:28:40] sure [13:28:45] (03CR) 10Andrew Bogott: [C: 03+2] "> What about labtestpuppetmaster2001? That'll probably also go away entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [13:28:46] but in general I think people have been blocked for that kind of creation rate even without imminent DB problems [13:28:57] IMHO it would be totally acceptable [13:28:59] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:29:22] bots are a no-brainer, Lucas_WMDE [13:29:36] but if it is not urgent, I think it can wait [13:31:26] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Izno) >>! In T232491#5481717, @BBlack wrote: > Can any previous reporters confirm the same continued breakage, or new su... [13:32:26] (03PS1) 10Mathew.onipe: Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) [13:32:30] (03CR) 10Ottomata: Add $ensure params with defaults for eventlogging service - no op (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [13:33:33] (03CR) 10jerkins-bot: [V: 04-1] Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:35:48] (03PS1) 10Giuseppe Lavagetto: envoyproxy: fixup for hot restarted [puppet] - 10https://gerrit.wikimedia.org/r/535846 [13:35:57] PROBLEM - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:58] effie: _joe_ I have one which is apparently php7.2 related :-\ https://phabricator.wikimedia.org/T232613 [13:36:06] (03PS1) 10KartikMistry: apertium-dan: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/535847 (https://phabricator.wikimedia.org/T218184) [13:36:12] (03PS5) 10Muehlenhoff: Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 [13:36:37] <_joe_> hashar: "one" what? [13:36:38] (03PS1) 10Ottomata: Remove ensure parameter in service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/535848 (https://phabricator.wikimedia.org/T232122) [13:36:42] err sorry [13:36:44] one fault / log spam [13:36:50] <_joe_> hashar: also unless it's a system-side thing [13:36:52] seems some array/value is not properly set sometime [13:37:01] <_joe_> I wouldn't take it to us [13:37:08] <_joe_> but can I see the logs? [13:38:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: fixup for hot restarted [puppet] - 10https://gerrit.wikimedia.org/r/535846 (owner: 10Giuseppe Lavagetto) [13:38:35] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:43] <_joe_> hashar: ? [13:39:15] _joe_: I have wrote bunch of details on https://phabricator.wikimedia.org/T232613 [13:39:16] (03CR) 10Ottomata: [C: 03+2] Remove ensure parameter in service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/535848 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [13:39:20] somehow some index is not defined [13:39:25] (03PS2) 10Ottomata: Remove ensure parameter in service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/535848 (https://phabricator.wikimedia.org/T232122) [13:39:25] 297 if ( !$groupLoads[ILoadBalancer::GROUP_GENERIC] ) { [13:39:27] bah [13:39:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove ensure parameter in service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/535848 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [13:39:33] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:39:54] hashar: looking [13:40:04] includes/libs/rdbms/loadbalancer/ILoadBalancer.php: const GROUP_GENERIC = ''; [13:40:05] hehe [13:40:14] or it is an issue in the database configuration array [13:40:26] (03PS2) 10Mathew.onipe: Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) [13:40:33] <_joe_> hashar: that would be more worrisome [13:41:01] I don't know anything about the database load balancer in mediawiki nowadays :-\ [13:41:10] at least it only happens from time to time and apparently only on php7.2 [13:41:36] <_joe_> hashar: I don't see how this can be related to systems if it's specific to one version of mediawiki [13:41:37] (03CR) 10jerkins-bot: [V: 04-1] Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:41:45] <_joe_> we had no issue before, we have issues now [13:41:51] <_joe_> => bug in the code [13:42:06] <_joe_> like code running at the same time in the same server with the same db config [13:42:11] <_joe_> doesn't have the issue [13:42:13] RECOVERY - Check systemd state on ununpentium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:24] <_joe_> unless there is something wrong in the database definitions somewhere [13:42:46] _joe_: hypothetically a bad config could be unused on a previous version, but I agree with you [13:43:05] GROUP_GENERIC = '' seens right to me [13:43:25] but if that doesn't use the default weights, it is likely a problem with an update on the load balancer code [13:43:27] <_joe_> jynus: not /that/ config [13:43:39] (03PS6) 10Muehlenhoff: Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 [13:44:23] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10Aklapper) Yay, thanks everyone. :) SRE will need your email addresses to add you. (Note that Phab is public so feel free to obfuscate a bit, for spam bots.) [13:44:30] _joe_: I think in any case, reverting and then filing an system bug shoudl be the say [13:44:40] if it was an infra issue [13:44:57] but if it fails on code deploy, it has to be researched first on software side [13:44:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] Ganeti: add netbox to Ganeti api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 (owner: 10Jbond) [13:45:09] and it only occurs on php7.2 for some reason [13:46:29] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:06] (03CR) 10Muehlenhoff: [C: 03+2] Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [13:51:40] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10elukey) @ayounsi I have sent two emails to their NOC but no answer, should we remove the peering config? [13:51:57] (03PS3) 10Mathew.onipe: Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) [13:52:49] 10Operations, 10netops: BGP session down for AS 20485 on cr2-esams - https://phabricator.wikimedia.org/T230004 (10elukey) 05Open→03Resolved ` elukey@re0.cr2-esams> show bgp summary | match 20485 80.249.210.177 20485 12254 7659 0 0 2d 14:25:53 Establ ` All good! [13:53:07] (03CR) 10jerkins-bot: [V: 04-1] Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:54:13] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10StephaneKiwix) @Aklapper kelson and stephane, both @kiwix.org. Thanks! [13:54:24] (03PS1) 10Ottomata: Remove eventbus.discovery info [dns] - 10https://gerrit.wikimedia.org/r/535852 (https://phabricator.wikimedia.org/T232122) [13:55:09] (03PS1) 10KartikMistry: apertium-swe: New upstream release [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/535853 (https://phabricator.wikimedia.org/T218184) [13:55:54] effie: _joe_ : jynus: there are only a few occurences of that LBFactoryMulti undefined index, so I am letting it though. I guess Aaron will figure out later today ;] [13:55:58] (03PS1) 10Ottomata: Remove eventbus.discovery [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) [13:58:48] great [13:58:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove eventbus.discovery info [dns] - 10https://gerrit.wikimedia.org/r/535852 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:00:06] <_joe_> hashar: I would advise against letting it throuhg [14:00:21] <_joe_> it's only going to get worse if we promote .22 to group2 [14:00:22] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10elukey) [14:00:28] (03CR) 10Alexandros Kosiaris: "block on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/535852/ but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:01:19] (03PS1) 10Jbond: puppetmaster1003: move dubnium, ores1001 & wtp1025 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535856 (https://phabricator.wikimedia.org/T228657) [14:02:13] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:02:49] _joe_: well I have made it a blocker [14:03:46] (03PS2) 10Jbond: puppetmaster1003: move dubnium, ores1001 & wtp1025 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535856 (https://phabricator.wikimedia.org/T228657) [14:03:49] the empty string is the right defaut value for that var fwiw [14:04:35] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move dubnium, ores1001 & wtp1025 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535856 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:07:57] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [14:12:47] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:13:16] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10jbond) p:05Triage→03Normal [14:13:46] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10jbond) p:05Triage→03Normal [14:14:45] (03PS1) 10Elukey: profile::mediawiki::webserver: remove hhvm restart cron when needed [puppet] - 10https://gerrit.wikimedia.org/r/535859 [14:16:33] (03PS1) 10Gilles: Gzip SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) [14:17:25] (03PS1) 10Muehlenhoff: Clean up puppetised puppetdb default file [puppet] - 10https://gerrit.wikimedia.org/r/535861 [14:21:19] (03PS1) 10Jbond: puppetmaster1003: move rdb1006, restbase1016 & scb1001 to pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535862 (https://phabricator.wikimedia.org/T228657) [14:21:27] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:22:40] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18241/" [puppet] - 10https://gerrit.wikimedia.org/r/535861 (owner: 10Muehlenhoff) [14:24:47] (03CR) 10Mforns: Add config for wmf_netflow to Turnilo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [14:25:56] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10Jclark-ctr) swapped drive slot 3 @Cmjohnson [14:26:15] (03PS2) 10Ottomata: Remove eventbus.discovery [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) [14:27:27] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move rdb1006, restbase1016 & scb1001 to pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535862 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:29:11] (03PS1) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) [14:29:26] (03CR) 10jerkins-bot: [V: 04-1] apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:30:45] (03PS2) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) [14:30:57] (03CR) 10jerkins-bot: [V: 04-1] apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [14:32:08] so train looks fine so far [14:32:17] taking a break [14:33:35] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:33:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] Remove LVS, discovery, and secondary monitoring of eventbus service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:37:07] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10akosiaris) @Jclark-ctr thanks. I can confirm that the array is being rebuilt! ` Enclosure Device ID: 15 Slot Number: 3 [snip] Firmware state: Rebuild [snip] ` [14:37:10] ACKNOWLEDGEMENT - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.008 second response time Cas Rusnov This is going away soon. https://wikitech.wikimedia.org/wiki/Netbox [14:37:25] ACKNOWLEDGEMENT - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov This is going away soon. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:49] (03PS1) 10Jbond: eventbus: class service::configuration has no ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/535865 [14:37:56] (03CR) 10CRusnov: "Thanks for this :)" [puppet] - 10https://gerrit.wikimedia.org/r/535841 (owner: 10Jbond) [14:38:27] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:56] (03PS4) 10Jbond: Ganeti: add netbox to Ganeti api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 [14:39:47] (03PS2) 10Jbond: eventbus: class service::configuration has no ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/535865 (https://phabricator.wikimedia.org/T232122) [14:40:55] ottomata: can you check ^^, currently causing alerts [14:41:16] (03CR) 10Jbond: [C: 03+2] Ganeti: add netbox to Ganeti api hosts [puppet] - 10https://gerrit.wikimedia.org/r/535841 (owner: 10Jbond) [14:41:50] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Ahecht) It's working again for me as well. [14:44:45] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:32] shush [14:48:34] (03PS1) 10Andrew Bogott: cloud cumin: add a second cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535866 (https://phabricator.wikimedia.org/T232429) [14:50:13] (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: add a second cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535866 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [14:54:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/535861 (owner: 10Muehlenhoff) [14:57:36] (03CR) 10Mobrovac: "Thnx!" [puppet] - 10https://gerrit.wikimedia.org/r/535848 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:59:08] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10ayounsi) Yep! I can walk you through it if needed. [15:01:08] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10ayounsi) I think it's safe to delete 28598 if they don't reply to your most recent email. About 12871 you're correct, or they're migrating something. Best is to ask them, then delete the down sessions if no r... [15:01:54] (03CR) 10Subramanya Sastry: "Hmm ... I started a new test run y'day evening ... and there are no "parsoid-tests" entries in logstash ... and these aren't in the mediaw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [15:02:08] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10jbond) p:05Triage→03Normal [15:03:01] !log downtimed dns-discovery confd health checks for eventbus - T232122 [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] T232122: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 [15:03:21] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 1415 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Netbox [15:03:47] (03CR) 10Ayounsi: [C: 03+2] "Thanks, I'll probably need to be walked through adding that test." [software/homer] - 10https://gerrit.wikimedia.org/r/535722 (owner: 10Ayounsi) [15:04:15] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10JHedden) a:03JHedden [15:04:24] (03CR) 10BBlack: [C: 03+2] Remove eventbus.discovery info [dns] - 10https://gerrit.wikimedia.org/r/535852 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:04:41] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10RobH) a:05RobH→03faidon @Faidon, Please note that T227425 & T227288 are for spare pool allocations for kerbos in both codfw and eqiad. as such,... [15:05:29] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10RobH) a:05RobH→03faidon @Faidon, Please note that T227425 & T227288 are for spare pool allocations for kerbos in both codfw and eqiad. as such,... [15:07:36] (03PS3) 10BBlack: Remove eventbus.discovery [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:07:37] PROBLEM - netbox SSL on netmon1002 is CRITICAL: SSL CRITICAL - failed to verify netbox.wikimedia.org against librenms.wikimedia.org https://wikitech.wikimedia.org/wiki/Netbox [15:07:54] also shush [15:08:36] (03CR) 10BBlack: "Seems ok on authdns effects the compiler can see: https://puppet-compiler.wmflabs.org/compiler1002/18242/authdns1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:08:41] (03CR) 10BBlack: [C: 03+2] Remove eventbus.discovery [puppet] - 10https://gerrit.wikimedia.org/r/535855 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:08:59] (03CR) 10Ottomata: Remove LVS, discovery, and secondary monitoring of eventbus service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:09:29] (03CR) 10Ottomata: [C: 03+2] eventbus: class service::configuration has no ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/535865 (https://phabricator.wikimedia.org/T232122) (owner: 10Jbond) [15:09:34] (03PS3) 10Ottomata: eventbus: class service::configuration has no ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/535865 (https://phabricator.wikimedia.org/T232122) (owner: 10Jbond) [15:09:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventbus: class service::configuration has no ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/535865 (https://phabricator.wikimedia.org/T232122) (owner: 10Jbond) [15:10:41] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:11:21] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:26] (03CR) 10Ottomata: "FYI, I will not use this patch, as it does too much at once. Abandoning in favor of smaller patches following https://wikitech.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:11:32] (03Abandoned) 10Ottomata: Remove LVS, discovery, and secondary monitoring of eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:13:29] PROBLEM - Check the Netbox report-s- cables for fail status. on netmon1002 is CRITICAL: NRPE: Command check_check_netbox_cables not defined https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:15:13] (03PS1) 10Ottomata: Remove LVS/pybal config for eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535872 (https://phabricator.wikimedia.org/T232122) [15:17:03] (03Merged) 10jenkins-bot: Make the the devices.yaml config stanza optional [software/homer] - 10https://gerrit.wikimedia.org/r/535722 (owner: 10Ayounsi) [15:17:36] (03PS2) 10Ottomata: Remove LVS/pybal config for eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535872 (https://phabricator.wikimedia.org/T232122) [15:21:48] (03PS2) 10Nuria: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) [15:23:11] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:20] (03PS1) 10Jbond: puppetmaster1003: move ms-be1016, ms-fe1005 & thumbor1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535873 (https://phabricator.wikimedia.org/T228657) [15:24:22] (03CR) 10Nuria: Add config for wmf_netflow to Turnilo (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) (owner: 10Nuria) [15:24:56] (03PS2) 10Jbond: puppetmaster1003: move ms-be1016, ms-fe1005 & thumbor1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535873 (https://phabricator.wikimedia.org/T228657) [15:25:47] (03CR) 10Krinkle: "The statements in apache/sites/main.conf seem related. Not sure if it should be placed there and/or whether that would work." [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [15:25:50] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10bd808) [15:26:16] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move ms-be1016, ms-fe1005 & thumbor1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535873 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [15:27:34] (03CR) 10Gilles: "Oh yes, I was wondering if something like this was done anywhere else but I couldn't find it. It makes sense to group with main.conf, I wi" [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [15:28:41] 10Operations, 10Tools, 10cloud-services-team (Kanban): Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10bd808) 05Open→03Resolved a:03bd808 https://gerrit.wikimedia.org/r/#/c/operations/docker-images/toollabs-images/+/527652/ [15:29:03] (03PS2) 10Gilles: Gzip SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) [15:29:53] (03PS2) 10Muehlenhoff: Clean up puppetised puppetdb default file [puppet] - 10https://gerrit.wikimedia.org/r/535861 [15:31:44] !log replacing fan kit and power supplies on cr2-codfw [15:32:24] (03CR) 10Muehlenhoff: [C: 03+2] Clean up puppetised puppetdb default file [puppet] - 10https://gerrit.wikimedia.org/r/535861 (owner: 10Muehlenhoff) [15:34:59] 10Operations, 10Cloud-VPS, 10Epic, 10IPv6, 10cloud-services-team (Kanban): Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947 (10bd808) [15:36:01] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:36:39] woot thanks jbond42 :) [15:38:06] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [15:38:40] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Harmonia_Amanda) It works for me too! Thank you! [15:38:50] (03PS1) 10Muehlenhoff: Fix distro check in puppetdb default file for JAVA_BIN [puppet] - 10https://gerrit.wikimedia.org/r/535877 [15:39:04] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Set up LVS for labs dns recursors - https://phabricator.wikimedia.org/T119660 (10Andrew) 05Open→03Declined I'm no longer clear that this is a good idea/necessary [15:39:44] (03CR) 10Jbond: netbox: Various remaining fixes. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535750 (owner: 10CRusnov) [15:40:28] (03PS6) 10Dzahn: gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 [15:41:32] (03PS3) 10BBlack: Remove LVS/pybal config for eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535872 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:42:36] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18243/" [puppet] - 10https://gerrit.wikimedia.org/r/535877 (owner: 10Muehlenhoff) [15:42:43] (03PS2) 10Muehlenhoff: Fix distro check in puppetdb default file for JAVA_BIN [puppet] - 10https://gerrit.wikimedia.org/r/535877 [15:43:43] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Andrew) I built a second cumin host, cloud-cumin-02.cloudinfra.eqiad.wmflabs. It's partly for backup, and partly beca... [15:44:14] 10Operations, 10Data-Services, 10decommission: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10bd808) [15:44:42] (03CR) 10jenkins-bot: Make the the devices.yaml config stanza optional [software/homer] - 10https://gerrit.wikimedia.org/r/535722 (owner: 10Ayounsi) [15:45:27] (03CR) 10BBlack: [C: 03+2] "Seems sane in PCC: https://puppet-compiler.wmflabs.org/compiler1001/18244/lvs1016.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535872 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:46:38] (03PS3) 10CRusnov: netbox: Various remaining fixes. [puppet] - 10https://gerrit.wikimedia.org/r/535750 [15:46:53] (03CR) 10CRusnov: netbox: Various remaining fixes. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535750 (owner: 10CRusnov) [15:47:23] (03CR) 10Muehlenhoff: [C: 04-1] "That's wrong: We should stop to use ldap::config::labs::ldapconfig in production entirely. Instead this should source the existing ldap: c" [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [15:47:37] (03PS2) 10Andrew Bogott: Labs cumin masters: Remove config associated with proxying via bastion [puppet] - 10https://gerrit.wikimedia.org/r/535733 (https://phabricator.wikimedia.org/T232429) (owner: 10Alex Monk) [15:48:17] !log lvs2006 - remove eventbus.svc.codfw.wmnet service, restart pybal, etc [15:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:10] !log lvs1016 - remove eventbus.svc.eqiad.wmnet service, restart pybal, etc [15:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:47] (03CR) 10Dzahn: [C: 03+2] gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [15:49:56] (03PS7) 10Dzahn: gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 [15:51:27] !log lvs2003 - remove eventbus.svc.codfw.wmnet service, restart pybal, etc [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:50] (03CR) 10Andrew Bogott: [C: 03+2] Labs cumin masters: Remove config associated with proxying via bastion [puppet] - 10https://gerrit.wikimedia.org/r/535733 (https://phabricator.wikimedia.org/T232429) (owner: 10Alex Monk) [15:52:42] !log lvs1015 - remove eventbus.svc.eqiad.wmnet service, restart pybal, etc [15:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/535750 (owner: 10CRusnov) [15:55:43] (03CR) 10CRusnov: [C: 03+2] netbox: Various remaining fixes. [puppet] - 10https://gerrit.wikimedia.org/r/535750 (owner: 10CRusnov) [15:55:54] (03PS4) 10CRusnov: netbox: Various remaining fixes. [puppet] - 10https://gerrit.wikimedia.org/r/535750 [15:57:07] (03PS8) 10Dzahn: gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 [15:57:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/535877 (owner: 10Muehlenhoff) [15:57:31] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Andrew) a:05Andrew→03Krenair I think this task is done but I'll let @krenair comment and close :) [15:57:52] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:59:31] (03PS1) 10Ottomata: Remove profile::lvs::realserver from profile::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/535882 (https://phabricator.wikimedia.org/T232122) [15:59:33] (03PS1) 10BBlack: Remove eventbus.svc DNS records [dns] - 10https://gerrit.wikimedia.org/r/535883 (https://phabricator.wikimedia.org/T232122) [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T1600). [16:00:04] tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:23] I can SWAT today! [16:00:27] tgr: around? [16:00:36] o/ thanks Urbanecm! [16:01:18] tgr: +2'ed your backport [16:01:43] (03CR) 10Ottomata: [C: 03+2] Remove profile::lvs::realserver from profile::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/535882 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:01:49] (03PS2) 10Ottomata: Remove profile::lvs::realserver from profile::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/535882 (https://phabricator.wikimedia.org/T232122) [16:01:53] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove profile::lvs::realserver from profile::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/535882 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:02:06] (03PS2) 10Urbanecm: Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) [16:02:23] (03CR) 10BBlack: [C: 03+2] Remove eventbus.svc DNS records [dns] - 10https://gerrit.wikimedia.org/r/535883 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [16:03:08] (03CR) 10jerkins-bot: [V: 04-1] Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) (owner: 10Urbanecm) [16:03:47] (03PS3) 10Urbanecm: Add autopatrolled user group to az.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533625 (https://phabricator.wikimedia.org/T231493) (owner: 10DannyS712) [16:04:08] (03CR) 10Urbanecm: [C: 03+2] Add autopatrolled user group to az.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533625 (https://phabricator.wikimedia.org/T231493) (owner: 10DannyS712) [16:04:48] (03PS3) 10Urbanecm: Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) [16:05:11] (03Merged) 10jenkins-bot: Add autopatrolled user group to az.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533625 (https://phabricator.wikimedia.org/T231493) (owner: 10DannyS712) [16:05:58] (03PS9) 10Dzahn: gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 [16:06:43] (03CR) 10jenkins-bot: Add autopatrolled user group to az.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533625 (https://phabricator.wikimedia.org/T231493) (owner: 10DannyS712) [16:07:06] (03CR) 10Urbanecm: [C: 03+2] Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) (owner: 10Urbanecm) [16:07:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: eceaccf: Add autopatrolled user group to az.wikibooks (T231493) (duration: 01m 06s) [16:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:26] T231493: Add autopatroller user group to az.wikibooks - https://phabricator.wikimedia.org/T231493 [16:08:02] (03Merged) 10jenkins-bot: Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) (owner: 10Urbanecm) [16:08:04] (03PS1) 10Ottomata: Remove eventbus LVS hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/535884 (https://phabricator.wikimedia.org/T232122) [16:08:08] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:08:51] (03CR) 10jenkins-bot: Add new whitelist rule for Université de Lorraine course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535831 (https://phabricator.wikimedia.org/T232596) (owner: 10Urbanecm) [16:10:30] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 510aa6b: Add new whitelist rule for Université de Lorraine course (T232596) (duration: 01m 04s) [16:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:34] T232596: Request for a temporary IP lift - https://phabricator.wikimedia.org/T232596 [16:11:03] (03CR) 10Dzahn: "merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/511614 instead. this is now a duplicate" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [16:11:08] (03PS3) 10Urbanecm: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:11:13] paladox: you should now be able to change LDAP config for Gerrit in labs by setting it in project Hiera. noop in prod. also you can abandon the other change above [16:13:48] (03PS5) 10Urbanecm: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21) [16:13:54] (03CR) 10Urbanecm: [C: 03+2] Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21) [16:14:35] (03PS2) 10Dzahn: DNS: Remove DNS mgmt asset tag WMF6403 [dns] - 10https://gerrit.wikimedia.org/r/535211 (https://phabricator.wikimedia.org/T200210) (owner: 10Papaul) [16:15:40] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Netbox [16:15:41] (03CR) 10Dzahn: [C: 03+2] DNS: Remove DNS mgmt asset tag WMF6403 [dns] - 10https://gerrit.wikimedia.org/r/535211 (https://phabricator.wikimedia.org/T200210) (owner: 10Papaul) [16:15:42] tgr: patch is merged [16:15:49] mutante: thanks!! [16:16:24] tgr: you can test on mwdebug1002 [16:16:29] let me know if it looks good tgr [16:16:45] paladox: remember you gotta copy the whole ldap_config part, not just server name. but that also means you are even more flexible [16:16:47] !log bblack@cumin1001 conftool action : set/pooled=no; selector: cluster=eventbus [16:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:09] (03CR) 10BBlack: [C: 03+1] Remove eventbus LVS hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/535884 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:18:17] (03CR) 10Urbanecm: [C: 03+2] Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:18:31] Ok [16:18:41] (03PS2) 10Ottomata: Remove eventbus LVS hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/535884 (https://phabricator.wikimedia.org/T232122) [16:18:54] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove eventbus LVS hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/535884 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:18:58] paladox: also please abandon the duplicate [16:23:02] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21) [16:23:02] Urbanecm: hm, it still looks broken [16:23:16] this should have been enough time for ResourceLoader to catch up [16:23:21] (03CR) 10jenkins-bot: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21) [16:23:37] tgr: let me ensure if I did everything correctly [16:24:21] !log bootstrapping Cassandra, restbase-dev1005-a -- T224554 [16:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:25] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [16:25:19] Urbanecm: duh, sorry, I was looking on a .21 wiki [16:25:27] tgr: the code seems to be on mwdebug1002 for .22 [16:25:32] ah, that explains it :) [16:25:39] only testwiki should be on .22, right? [16:25:45] (from Growth target wikis) [16:25:45] do we have anything in group 0/1 that has homepages enabled? [16:25:52] right, testwiki [16:26:25] (03CR) 10Dzahn: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/535211 (https://phabricator.wikimedia.org/T200210) (owner: 10Papaul) [16:27:19] Urbanecm: looks good, thanks! [16:27:37] cool, will sync soon tgr ! [16:28:04] (03PS4) 10Urbanecm: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:28:14] (03CR) 10Urbanecm: [C: 03+2] Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:28:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 565fafa: Set noindex for user and user_talk on zhwiki (T231982) (duration: 01m 05s) [16:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:28] T231982: NOINDEX userpages within the Chinese Wikipedia - https://phabricator.wikimedia.org/T231982 [16:29:43] (03Merged) 10jenkins-bot: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:29:59] (03CR) 10jenkins-bot: Remove OTRS-member usergroup from fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535749 (https://phabricator.wikimedia.org/T232554) (owner: 104nn1l2) [16:30:13] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10Papaul) [16:30:40] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:31:05] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) [16:31:08] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10Papaul) 05Open→03Resolved The server was used to replace the old mw2232 see T232126 [16:31:10] 10Operations, 10ops-eqiad: helium array has slot 3 disk failed - https://phabricator.wikimedia.org/T232591 (10wiki_willy) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr [16:31:17] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/GrowthExperiments/modules/homepage/ext.growthExperiments.StartModule.less: SWAT: c45d6d0: Homepage: Fix start module layout bugs (T230629, T232549, T225668) (duration: 01m 03s) [16:31:20] tgr: synced [16:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:22] T232549: Start module rendering broken in Safari 10 on desktop - https://phabricator.wikimedia.org/T232549 [16:31:23] T225668: Homepage: start module layout on iOS Safari 10 and iOS Chrome 60.0.3112.89 - https://phabricator.wikimedia.org/T225668 [16:31:23] T230629: Start module (overlay and server-side rendered view) layout broken in Firefox - https://phabricator.wikimedia.org/T230629 [16:32:14] !log mwscript importImages.php --wiki=commonswiki --user=Abbe98 --comment-ext=txt /home/urbanecm/T232346 [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:47] hm, I wonder why this fix was only deployed to .22 [16:33:58] maybe I misunderstood something? [16:34:41] tgr: because you only backported it to .22 [16:34:48] do you want to have it backported it also to .21? [16:35:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 76991f2: Remove OTRS-member usergroup from fawiki (T232554) (duration: 01m 05s) [16:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:25] T232554: Remove OTRS-member usergroup from the Persian Wikipedia - https://phabricator.wikimedia.org/T232554 [16:35:51] tgr: if you want, I can backport the patch to the other active branch as well :-). [16:35:53] I wasn't the one who backported it and don't know much about the context of the bug, but it definitely looks broken on the 'pedias [16:36:12] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10RobH) [16:36:17] it probably will be, since the patch was merged today [16:36:21] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10RobH) [16:36:23] let's see if it backports cleanly [16:36:24] (I meant, you scheduled it) [16:36:50] !log ran conftool-merge on puppetmaster1001 (manually from sudo -i, to fixup missing updates) [16:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:11] but well, let's do https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/535887 as well tgr :) [16:38:26] (03Abandoned) 10Dzahn: Gerrit: Support switching ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [16:38:35] !log Run mwscript emptyUserGroup.php --wiki=fawiki OTRS-member (T232554) [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:16] (03PS2) 10Urbanecm: [rowiki] Allow sysops to remove patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533029 (https://phabricator.wikimedia.org/T231099) (owner: 10Strainu) [16:39:28] !log decommissioning Cassandra, restbase1018-a -- T224553 [16:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:30] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [16:39:40] (03CR) 10Urbanecm: [C: 03+2] [rowiki] Allow sysops to remove patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533029 (https://phabricator.wikimedia.org/T231099) (owner: 10Strainu) [16:40:26] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [16:40:38] (03PS1) 10BBlack: calico: remove eventbus fw rules [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) [16:40:51] (03Merged) 10jenkins-bot: [rowiki] Allow sysops to remove patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533029 (https://phabricator.wikimedia.org/T231099) (owner: 10Strainu) [16:41:13] (03CR) 10BBlack: "IIRC, the merge for this is non-standard off in another repo or branch somewhere..." [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [16:41:20] (03CR) 10jenkins-bot: [rowiki] Allow sysops to remove patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533029 (https://phabricator.wikimedia.org/T231099) (owner: 10Strainu) [16:42:12] Urbanecm: none of the changes between .21 and .22 seem like they could interfere, so let's do it [16:42:27] tgr: okay. I've already +2'ed, let's wait for CI [16:42:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 6007fbc: [rowiki] Allow sysops to remove patrollers (T231099) (duration: 01m 03s) [16:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:39] T231099: Patroller rights changes for ro.wp - https://phabricator.wikimedia.org/T231099 [16:43:38] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10Aklapper) @MBeat33: Could you maybe answer @BBlack's last question? Or do you know who could? (Asking you because of https://phabricator.wikimedia.org/T228672#5358426 ) [16:44:32] (03PS3) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/535863 (https://phabricator.wikimedia.org/T218184) [16:44:51] 10Operations, 10ops-codfw: ganeti2005 - mgmt interface stopped responding and reset fails - https://phabricator.wikimedia.org/T232067 (10Dzahn) 05Open→03Invalid For some reason this self-healed and is working again without further action. [16:47:48] Urbanecm: done? [16:48:00] Krinkle: waiting on CI [16:48:02] ok [16:49:13] 10Operations, 10MediaWiki-Maintenance-scripts, 10media-storage: Server side upload failed with "overwriting failed (at recordUpload stage)" - https://phabricator.wikimedia.org/T231738 (10Urbanecm) 05Open→03Invalid Then it's probably some temporary issue. I'll let you know if it re-appears. [16:49:17] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:49:30] !log manually removed decommed eventbus LVS IP on kafka-main2001 [16:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:57] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:50:43] !log manually removed decommed eventbus LVS IP on kafka-main200[23] [16:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:15] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:53:11] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 2.780 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:53:20] known i'm debugging a thing [16:54:02] !log manually removed decommed eventbus LVS IP on kafka-main1001 [16:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:15] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 368 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:54:16] (03PS1) 10RobH: adding in new dell skus [software] - 10https://gerrit.wikimedia.org/r/535891 [16:54:39] !log manually removed decommed eventbus LVS IP on kafka100[23] [16:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:43] Urbanecm: it's in [16:56:02] (03CR) 10RobH: [C: 03+2] update to dell skus [software] - 10https://gerrit.wikimedia.org/r/530597 (owner: 10RobH) [16:56:09] yup, syncing [16:56:13] (03CR) 10RobH: [C: 03+2] adding in new dell skus [software] - 10https://gerrit.wikimedia.org/r/535891 (owner: 10RobH) [16:56:59] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/GrowthExperiments/modules/homepage/ext.growthExperiments.StartModule.less: SWAT: c0fd061: Homepage: Fix start module layout bugs (T230629, T232549, T225668) (duration: 01m 02s) [16:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:07] T232549: Start module rendering broken in Safari 10 on desktop - https://phabricator.wikimedia.org/T232549 [16:57:08] T230629: Start module (overlay and server-side rendered view) layout broken in Firefox - https://phabricator.wikimedia.org/T230629 [16:57:08] T225668: Homepage: start module layout on iOS Safari 10 and iOS Chrome 60.0.3112.89 - https://phabricator.wikimedia.org/T225668 [16:57:25] tgr: synced [16:58:20] thanks Urbanecm! looks good in Firefox [16:58:28] good! [16:59:35] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:02:48] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10BBlack) 05Resolved→03Open Re-open as this isn't really complete yet, the battery came in and replacement is proceeding. Since @jijiki did this before and claims it's just a depool command, w... [17:02:53] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:05:11] !log restbase2009 - depool for hardware work - T227408 [17:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:14] T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 [17:07:20] !log restbase2009 - shutdown for hardware work - T227408 [17:07:21] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Netbox [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:04] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10Dzahn) T228672 says nobody in charge of the Shop is even on Phabricator :( Looks like we have to email merchandise@ to get this bumped. [17:09:21] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10greg) Thanks @jbond! [17:09:37] 10Operations, 10FR-Q2-FY2019-20-cleanup-list, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10DStrine) [17:10:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:10:41] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a response wa [17:10:41] //wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:05] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:05] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:11] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) That's kind of ridiculous... [17:11:15] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a respons [17:11:15] tps://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:15] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:17] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:23] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:45] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 1.777 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:11:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:12:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:13:17] RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:13:23] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:13:43] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:14:17] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:14:53] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:14:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:45] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:16:05] <_joe_> uhm [17:16:50] ulsfo is depooled [17:16:59] if you're looking at that part [17:17:17] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:25] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:27] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:33] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:48] (03PS1) 10CRusnov: profile::puppetdb: Fix ferm rule for netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/535894 [17:17:51] (03PS1) 10Dzahn: Add CSP headers for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/535895 (https://phabricator.wikimedia.org/T213223) [17:18:01] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:18:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:19:46] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) Smart storage replacement complete. Embedded HPE Smart Storage Battery 875241-B21 878643-001 6WQXL0BB2BQ4H8 01 0.60 OK [17:20:09] (03CR) 10Dzahn: [C: 03+2] Add CSP headers for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/535895 (https://phabricator.wikimedia.org/T213223) (owner: 10Dzahn) [17:21:10] (03PS2) 10CRusnov: profile::puppetdb: Fix ferm rule for netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/535894 [17:22:46] (03CR) 10CRusnov: "compiler looks good https://puppet-compiler.wmflabs.org/compiler1001/18246/puppetdb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535894 (owner: 10CRusnov) [17:22:56] (03PS3) 10CRusnov: profile::puppetdb: Fix ferm rule for netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/535894 [17:23:55] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:08] (03CR) 10CRusnov: [C: 03+2] profile::puppetdb: Fix ferm rule for netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/535894 (owner: 10CRusnov) [17:27:32] !log restbase2009 - re-pool - T227408 [17:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:36] T227408: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 [17:28:58] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Netbox [17:30:03] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:30:11] duzzah [17:30:13] huzzah [17:31:43] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:31:43] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:32:43] !log enable GRE MTU mitigation on eqsin caches (cp5xxx) - T232602 [17:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:47] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:32:48] T232602: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 [17:33:12] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10BBlack) [17:34:44] (03CR) 10Krinkle: [C: 03+2] logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [17:34:57] (03CR) 10jerkins-bot: [V: 04-1] logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [17:35:46] * Krinkle staging on mwdebug1002 [17:36:16] (03PS2) 10Krinkle: logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) [17:36:22] (03PS2) 10Krinkle: logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) [17:36:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:18] 10Operations, 10MediaWiki-General, 10serviceops, 10CPT Initiatives (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF) [17:42:51] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:42:58] !log ayounsi@deploy1001 Started deploy [librenms/librenms@2a06e98]: Upgrade LibreNMS to 1.55 - T232599 [17:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:07] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2a06e98]: Upgrade LibreNMS to 1.55 - T232599 (duration: 00m 09s) [17:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:21] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:46:15] RECOVERY - Host ps1-b6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [17:47:01] PROBLEM - ps1-b6-eqiad-infeed-load-tower-A-phase-Y on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:01] PROBLEM - ps1-b6-eqiad-infeed-load-tower-B-phase-Y on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:21] PROBLEM - ps1-b6-eqiad-infeed-load-tower-A-phase-Z on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:21] PROBLEM - ps1-b6-eqiad-infeed-load-tower-B-phase-Z on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:21] PROBLEM - ps1-b6-eqiad-infeed-load-tower-B-phase-X on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:28] PROBLEM - ps1-b6-eqiad-infeed-load-tower-A-phase-X on ps1-b6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:48:09] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:49:59] (03CR) 10Krinkle: [C: 03+2] logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [17:54:00] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10thcipriani) [17:57:47] !log upgrade librenms to 1.55 [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:09] 04Critical [17:59:10] uh? just Critical? :) [17:59:11] (03Merged) 10jenkins-bot: logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [17:59:37] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) 05Open→03Resolved This can be resolved [18:00:13] (03CR) 10jenkins-bot: logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [18:02:44] vgutierrez: looks related to the upgrade. it should say "i am being upgraded and dont know why i'm critical" or so [18:02:46] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/WikimediaMaintenance/blameStartupRegistry.php: (no justification provided) (duration: 01m 05s) [18:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:09] 04̶C̶r̶i̶t̶i̶c̶a̶l [18:03:37] well that's interesting :) [18:04:04] content-free critical == librenms itself? :) [18:05:04] yep, strike through could be green instead [18:07:24] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) [18:07:36] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) 05Open→03Resolved Complete [18:11:08] (03PS2) 10Papaul: DNS: Change asset tag DNS for mw2231 [dns] - 10https://gerrit.wikimedia.org/r/535212 (https://phabricator.wikimedia.org/T231192) [18:14:53] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:14:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:14:55] * Krinkle staging on mwdebug1002 [18:15:19] !log nuria@deploy1001 Started deploy [analytics/refinery@f4c60a4]: v0.0.99 of refinery [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:23] (03PS3) 10Dzahn: install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) [18:16:40] !log nuria@deploy1001 Finished deploy [analytics/refinery@f4c60a4]: v0.0.99 of refinery (duration: 01m 21s) [18:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: d6865e3365e8 - T211124 (duration: 01m 04s) [18:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:55] T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 [18:18:40] (03CR) 10Gehel: [C: 04-1] "Instead of duplicating the module, we should extract the common functionalities. Or just rename the wdqs module to "query_service" and add" [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [18:18:55] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10MBeat33) Hi all, @Jseddon is on leave, but this is on his agenda for when he returns. I know he's engaged with Shopify about this issue. [18:20:34] (03CR) 10Krinkle: [C: 03+2] logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [18:22:21] (03Merged) 10jenkins-bot: logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [18:22:36] (03CR) 10jenkins-bot: logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [18:25:02] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:26:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:28:35] (03CR) 10Dzahn: [C: 03+2] install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [18:29:18] (03CR) 10Jforrester: Turn InitialiseSettings into a static array return for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [18:31:46] (03CR) 10Dzahn: [C: 03+2] DNS: Change asset tag DNS for mw2231 [dns] - 10https://gerrit.wikimedia.org/r/535212 (https://phabricator.wikimedia.org/T231192) (owner: 10Papaul) [18:31:59] (03PS3) 10Dzahn: DNS: Change asset tag DNS for mw2231 [dns] - 10https://gerrit.wikimedia.org/r/535212 (https://phabricator.wikimedia.org/T231192) (owner: 10Papaul) [18:33:24] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) @MBeat33 + @Jseddon - Thank you for the update(s) [18:33:42] !log nuria@deploy1001 Started deploy [analytics/refinery@fa994c7]: v0.0.99 of refinery, again, try II. last time shas commited by jenkins were incorrect [18:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:09] !log krinkle@deploy1001 Synchronized tests/: no-op ed8dd7aad9e5 (duration: 01m 05s) [18:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:34] 10Operations, 10User-fgiunchedi, 10Wikimedia-production-error: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) Still seen regularly. It's somewhat concerning me because we rely on mwdebug1002 to... [18:39:23] 10Operations, 10serviceops, 10User-fgiunchedi, 10Wikimedia-production-error: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [18:39:48] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:40] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: no-op ed8dd7aad9e5 (duration: 01m 06s) [18:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:08] (03CR) 10Krinkle: Turn InitialiseSettings into a static array return for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [18:42:21] !log nuria@deploy1001 Finished deploy [analytics/refinery@fa994c7]: v0.0.99 of refinery, again, try II. last time shas commited by jenkins were incorrect (duration: 08m 39s) [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:25] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T211124 ed8dd7aad9e5 (duration: 01m 04s) [18:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:27] T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 [18:43:47] !log decommissioning Cassandra, restbase1018-b -- T224553 [18:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:50] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [18:48:40] (03PS1) 10Nuria: Bumping up jar version to correct typo [puppet] - 10https://gerrit.wikimedia.org/r/535910 [18:54:02] made a Ganeti VM and console stays blank before and after a reboot. though status was shown as UP... [18:57:10] 10Operations, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10Andrew) [19:07:14] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:11:33] hmm.. even deleting (remove) the VM seems to hang [19:37:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:06] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:42:07] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:47:37] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) p:05Triage→03Normal [19:48:15] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) [19:48:27] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) p:05Triage→03Normal [19:49:46] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) [19:50:28] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [19:50:36] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10srishakatux) @Bmueller your help needed in approving my request! [19:52:27] (03PS1) 10Ottomata: Remove eventbus graphite threshold alert [puppet] - 10https://gerrit.wikimedia.org/r/535920 (https://phabricator.wikimedia.org/T232122) [19:52:29] (03PS1) 10Ottomata: Ensure eventlogging-service-eventbus is absent [puppet] - 10https://gerrit.wikimedia.org/r/535921 (https://phabricator.wikimedia.org/T232122) [19:54:59] (03CR) 10Ottomata: [C: 03+2] Remove eventbus graphite threshold alert [puppet] - 10https://gerrit.wikimedia.org/r/535920 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:59:07] (03CR) 10Ottomata: [C: 03+2] Ensure eventlogging-service-eventbus is absent [puppet] - 10https://gerrit.wikimedia.org/r/535921 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T2000). [20:00:17] no parsoid deploy today [20:01:44] (03PS1) 10Ottomata: profile::eventbus - Declare kafka $config in scope [puppet] - 10https://gerrit.wikimedia.org/r/535922 (https://phabricator.wikimedia.org/T232122) [20:02:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] profile::eventbus - Declare kafka $config in scope [puppet] - 10https://gerrit.wikimedia.org/r/535922 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [20:02:17] (03PS1) 10Mforns: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) [20:04:24] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [20:06:21] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@2c9e409]: Clean up old event style support T230049 [20:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:26] T230049: Delayed jobs fail validation in eventgate - https://phabricator.wikimedia.org/T230049 [20:07:14] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@2c9e409]: Clean up old event style support T230049 (duration: 00m 53s) [20:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:10] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@522177f]: Clean up old event style support [20:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:09] (03PS2) 10Mforns: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) [20:12:49] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@522177f]: Clean up old event style support (duration: 01m 39s) [20:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:56] (03PS1) 10Dzahn: install_server: update MAC for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/535926 [20:16:46] (03CR) 10Dzahn: [C: 03+2] install_server: update MAC for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/535926 (owner: 10Dzahn) [20:18:41] (03CR) 10Mforns: [C: 04-1] "I think this looks good now, but..." [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [20:18:45] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Krenair) 05Open→03Resolved a:05Krenair→03Andrew with the merging of https://gerrit.wikimedia.org/r/535727 I th... [20:18:47] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [20:21:00] (03PS2) 10Dzahn: install_server: update MAC for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/535926 [20:21:18] !log stopped and removed eventlogging-service-eventbus - T232122 [20:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:20] T232122: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 [20:27:05] (03PS1) 10Dzahn: use performance.discovery instead of webperf.discovery [dns] - 10https://gerrit.wikimedia.org/r/535927 [20:29:18] (03PS2) 10Dzahn: use performance.discovery instead of webperf.discovery [dns] - 10https://gerrit.wikimedia.org/r/535927 [20:29:37] (03PS1) 10Dzahn: ATS: switch webperf backends to TLS and discovery name [puppet] - 10https://gerrit.wikimedia.org/r/535929 (https://phabricator.wikimedia.org/T210411) [20:32:37] (03CR) 10Ottomata: [C: 03+1] "I don't even know what this does but +1 :p" [puppet] - 10https://gerrit.wikimedia.org/r/535889 (https://phabricator.wikimedia.org/T232122) (owner: 10BBlack) [20:39:26] (03CR) 10Dzahn: [C: 03+2] use performance.discovery instead of webperf.discovery [dns] - 10https://gerrit.wikimedia.org/r/535927 (owner: 10Dzahn) [20:39:40] (03PS1) 10Andrew Bogott: boostrapvz: disable systemd-timesyncd during first boot [puppet] - 10https://gerrit.wikimedia.org/r/535931 [20:39:43] jenkins-bot became slower to V+2 on DNS repo it feels [20:39:56] as in > 12 minutes [20:57:32] !log bootstrapping Cassandra, restbase-dev1005-b -- T224554 [20:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:37] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [21:06:59] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:07:55] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [21:08:26] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:11:35] 10Operations, 10ORES, 10serviceops, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) @Dzahn, what would it take to implement a check like the one I described above? I... [21:11:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:12:05] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:13:33] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:18:14] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Halfak) p:05High→03Normal [21:20:34] 10Puppet, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10Halfak) p:05High→03Low [21:21:05] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Stress test ORES/kubernetes (above 4.5k scores/second) - https://phabricator.wikimedia.org/T214054 (10Halfak) 05Open→03Stalled p:05High→03Low [21:21:07] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Halfak) [21:26:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:26:38] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:27:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:31:26] (03PS1) 10Dzahn: ATS: switch releases-jenkins to TLS [puppet] - 10https://gerrit.wikimedia.org/r/535936 (https://phabricator.wikimedia.org/T210411) [21:43:50] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:05] mumble [21:46:15] (03PS1) 10Dzahn: add fake TLS key for mwmaint [labs/private] - 10https://gerrit.wikimedia.org/r/535939 [21:46:37] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake TLS key for mwmaint [labs/private] - 10https://gerrit.wikimedia.org/r/535939 (owner: 10Dzahn) [21:52:23] (03PS1) 10Dzahn: add certificate for mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/535941 (https://phabricator.wikimedia.org/T210411) [21:59:00] (03PS1) 10Ayounsi: Add role netinsight to netflow2001 [puppet] - 10https://gerrit.wikimedia.org/r/535942 (https://phabricator.wikimedia.org/T226810) [22:01:54] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:04:06] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) [22:04:40] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Paladox) [22:06:25] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) This has the effect that these images are being considered content pageviews when they are just asset requests [22:13:09] (03PS1) 104nn1l2: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 [22:13:17] (03PS1) 10Ayounsi: LibreNMS: fix new deployments permissions errors [puppet] - 10https://gerrit.wikimedia.org/r/535945 [22:16:21] (03PS2) 104nn1l2: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) [22:18:54] 10Operations: Make contact group for Netbox report alerts - https://phabricator.wikimedia.org/T230725 (10Dzahn) The existing bot, icinga-wm, can be used. What is needed is: - create 2 custom notification commands (modules/nagios_common/templates/notification_commands.cfg.erb), notify-service-by-irc-dcops and n... [22:30:26] !log decommissioning Cassandra, restbase1018-c -- T224553 [22:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:30] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [22:30:58] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:34:22] 10Operations, 10Analytics, 10Traffic: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) I think we need to add proxy=googleweblight to x-analytics [22:36:42] !log add BGP session between cr2-eqord and netflow1001 [22:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:56] (03PS6) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [22:38:17] (03CR) 10Jforrester: Turn InitialiseSettings into a static array return for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [22:40:20] 10Operations: Make contact group for Netbox report alerts - https://phabricator.wikimedia.org/T230725 (10Dzahn) Added this on Wikitech because it had no docs. https://wikitech.wikimedia.org/wiki/Icinga#IRC_bot [22:41:34] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:41:47] 10Operations, 10observability: Make contact group for Netbox report alerts - https://phabricator.wikimedia.org/T230725 (10Dzahn) [22:41:51] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [22:42:16] (03CR) 10Dzahn: [C: 03+2] add certificate for mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/535941 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:43:08] !log `set protocols bgp group Netflow cluster 208.80.154.196` on cr1-eqiad [22:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:39] (03PS2) 10Dzahn: ATS: switch releases-jenkins to TLS [puppet] - 10https://gerrit.wikimedia.org/r/535936 (https://phabricator.wikimedia.org/T210411) [22:48:10] 08Warning [22:48:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:57] (03CR) 10Dzahn: [C: 03+2] ATS: switch releases-jenkins to TLS [puppet] - 10https://gerrit.wikimedia.org/r/535936 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190911T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:55] (03PS7) 10Krinkle: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [23:04:48] (03CR) 10Krinkle: Turn InitialiseSettings into a static array return for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [23:06:23] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [23:10:11] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Aklapper) @srishakatux: What's your LDAP user name? See https://wikitech.wikimedia.org/wiki/Production_shell_access#... [23:11:14] librenms-wmf: thanks for this useful message [23:14:18] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Dzahn) Adding Nuria as requested on Analytics access requests. [23:18:56] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) p:05Low→03Normal [23:19:06] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) 05Stalled→03Open [23:20:09] !log `set protocols bgp group Netflow cluster 208.80.154.197` on cr2-eqiad [23:20:10] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Paladox) [23:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:13] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Paladox) [23:33:15] (03PS4) 10Jforrester: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [23:34:44] (03PS2) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [23:34:46] (03PS2) 10Jforrester: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 [23:34:48] (03PS2) 10Jforrester: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 [23:34:50] (03PS3) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [23:34:52] (03PS3) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [23:34:54] (03PS8) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [23:36:58] (03PS1) 10Dzahn: acme_chief: add gerrit1001 as authorized host for gerrit certs [puppet] - 10https://gerrit.wikimedia.org/r/535962 (https://phabricator.wikimedia.org/T222391) [23:37:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:02] (03CR) 10Paladox: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/535962 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:39:39] (03PS7) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [23:39:41] (03PS4) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [23:39:43] (03PS1) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 [23:42:03] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) The additional steps needed before this can be switched to active are tracked in the parent task T222391. [23:42:19] (03PS1) 10Dzahn: gerrit: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) [23:43:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:52] (03CR) 10jerkins-bot: [V: 04-1] Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [23:44:06] (03CR) 10Paladox: gerrit: allow connections from gerrit1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:44:50] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [23:44:59] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [23:45:03] (03PS1) 10Dzahn: ci: allow ssh to new gerrit server gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535965 (https://phabricator.wikimedia.org/T222391) [23:45:10] (03CR) 10Dzahn: gerrit: allow connections from gerrit1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:45:31] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [23:45:53] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [23:46:11] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [23:46:38] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [23:46:50] 10Operations, 10Traffic, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [23:46:58] (03PS1) 10Dzahn: mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) [23:47:02] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read from JSON, not serialised PHP, for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:47:13] (03PS3) 10Cwhite: add the option of passing a custom metrics context manager to EndpointRequest [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [23:47:32] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [23:47:36] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (owner: 10Jforrester) [23:47:54] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:47:58] 10Operations, 10Traffic, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [23:48:23] 10Operations, 10Traffic, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [23:49:53] 10Operations, 10Traffic, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) >>! At T232252, @Agusbou2015 wrote: > > I click on "Publish changes" (on any Wikimedia project) and changes are not saved. > Ste... [23:51:43] (03PS1) 10Dzahn: smokeping: replace cobalt with gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535969 (https://phabricator.wikimedia.org/T222391) [23:51:45] (03PS2) 10Paladox: gerrit: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:51:49] (03CR) 10Paladox: [C: 03+1] gerrit: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:51:51] (03CR) 10Paladox: gerrit: allow connections from gerrit1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535964 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:53:11] (03CR) 10jerkins-bot: [V: 04-1] add the option of passing a custom metrics context manager to EndpointRequest [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [23:53:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:53:40] (03PS2) 10Dzahn: smokeping: replace cobalt with gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535969 (https://phabricator.wikimedia.org/T222391) [23:54:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:57:24] (03PS1) 10Dzahn: gerrit: add gerrit1001 to SSH known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/535971 (https://phabricator.wikimedia.org/T222391) [23:59:23] (03PS4) 10Cwhite: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807