[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:19:15] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T240917 (10SNowick_WMF) [00:26:57] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [00:38:48] (03CR) 10Bstorm: cloud: update maintain-views to handle dblists with comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [00:42:22] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10SNowick_WMF) [00:48:16] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10kzimmerman) @jcrespo I'm going to remove `analytics-wmde-users` (and yes, it should not have been `analytics-wmde`) from my reques... [00:48:35] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10kzimmerman) [00:49:10] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10kzimmerman) Approved as Shay's manager! [00:51:03] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10RobH) Please note that I've created a google sheet that lists off every single device in the racks, and their power cable mappings. I cannot link it here, as it is currently shared... [00:54:11] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10RobH) Ok, request submitted. Iron mountain has a few steps, first is a request ID SCTASK0137107. Next they'll convert my request into a different ticket # (for work order iirc) an... [00:59:09] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1138.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:12:00] (03PS1) 10Andrew Bogott: codfw1dev nova: add cloudvirt2002-dev to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/558275 [02:13:52] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev nova: add cloudvirt2002-dev to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/558275 (owner: 10Andrew Bogott) [02:38:51] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:44:02] (03PS1) 10Andrew Bogott: Revert "codfw1dev nova: add cloudvirt2002-dev to the scheduling pool" [puppet] - 10https://gerrit.wikimedia.org/r/558279 [02:45:23] (03CR) 10Andrew Bogott: [C: 03+2] Revert "codfw1dev nova: add cloudvirt2002-dev to the scheduling pool" [puppet] - 10https://gerrit.wikimedia.org/r/558279 (owner: 10Andrew Bogott) [03:05:09] (03PS1) 10Andrew Bogott: Revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 [03:13:44] (03PS2) 10Andrew Bogott: Revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) [03:15:24] (03PS3) 10Andrew Bogott: Partially revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) [03:35:03] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:43] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:58:03] (03PS1) 10Andrew Bogott: cloud base images: enable passwordless login on serial0 [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) [04:00:16] (03PS2) 10Andrew Bogott: cloud base images: enable passwordless login on serial0 [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) [04:32:07] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteThrough, currently using: WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:42:57] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteThrough policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:52:57] PROBLEM - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:06:25] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10RobH) Please note support went ahead and already did this, and updated me via task. I have the updated list, and will import into netbox later this week. [05:09:46] bleh, they werent supposed to do the work outside of hours but did, and now one los power [05:09:59] well, lost a psu of 2 [05:10:01] it seems [05:18:22] bleh, im going to have to make an international call, they havent updated the ticket. [05:29:23] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10MBq) Looks good now: JobQueue seems emptied. I just tried a couple of renamings which worke... [05:31:18] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) Thanks everybody involved on fixing the issue. @MBq apologies for the problems ca... [05:33:14] ok, itll go green [05:33:23] its fixed live on the mgmt, im watchign it [05:34:47] ok, now that its cleared (icinga will clear), and they have closed all the racks back up [05:34:51] i can go offline for a bit ;D [05:34:53] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Joe) p:05High→03Normal All jobs have recovered. Not closing the task as we need to redu... [05:43:02] (03CR) 10Jhedden: Partially revert "nova.conf ocata: remove [spice] config section" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) (owner: 10Andrew Bogott) [05:43:08] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) Just to be clear- there is no problem requesting it, it just seemed odd for WMF work, and that is why I asked. If later y... [05:49:26] (03PS4) 10Andrew Bogott: Partially revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) [05:49:38] (03CR) 10Andrew Bogott: [C: 04-1] "this doesn't do what I want it to do" [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) (owner: 10Andrew Bogott) [05:50:28] (03CR) 10Jhedden: [C: 03+1] "+1 per our procedures. I understand the operational benefits of this, but I think a persistent root login is risky." [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) (owner: 10Andrew Bogott) [05:54:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1081 after schema change', diff saved to https://phabricator.wikimedia.org/P9888 and previous config saved to /var/cache/conftool/dbconfig/20191217-055407-marostegui.json [05:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:35] RECOVERY - IPMI Sensor Status on lvs3007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:55:17] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm. Do you want to go ahead and include the hiera config for cloudvirt1022 in the same patch as an example usage?" [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [05:56:17] (03PS5) 10Andrew Bogott: Partially revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) [05:57:25] (03CR) 10Andrew Bogott: [C: 03+2] Partially revert "nova.conf ocata: remove [spice] config section" [puppet] - 10https://gerrit.wikimedia.org/r/558284 (https://phabricator.wikimedia.org/T240851) (owner: 10Andrew Bogott) [05:57:42] (03PS1) 10Jcrespo: admin: Provide access to kzimmerman (kzeta) to production analytics [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) [05:57:54] (03PS1) 10Marostegui: check_depooled: Add new sections managed by dbctl [software] - 10https://gerrit.wikimedia.org/r/558319 [05:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1091 schema change', diff saved to https://phabricator.wikimedia.org/P9889 and previous config saved to /var/cache/conftool/dbconfig/20191217-055848-marostegui.json [05:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:26] (03CR) 10Marostegui: [C: 03+2] check_depooled: Add new sections managed by dbctl [software] - 10https://gerrit.wikimedia.org/r/558319 (owner: 10Marostegui) [06:02:57] (03Merged) 10jenkins-bot: check_depooled: Add new sections managed by dbctl [software] - 10https://gerrit.wikimedia.org/r/558319 (owner: 10Marostegui) [06:06:08] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Pushed" [homer/public] - 10https://gerrit.wikimedia.org/r/557565 (https://phabricator.wikimedia.org/T240456) (owner: 10Ayounsi) [06:06:24] !log Upgrade db1130 T240823 [06:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:29] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [06:07:06] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Sotiale) (Rename task) I've tried both large and small cases with accounts to move. It work... [06:11:06] (03CR) 10BryanDavis: cloud: update maintain-views to handle dblists with comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [06:12:32] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:14:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on es2 slaves in eqiad and codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9890 and previous config saved to /var/cache/conftool/dbconfig/20191217-061707-marostegui.json [06:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:14] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [06:20:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on es3 slaves in eqiad and codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9891 and previous config saved to /var/cache/conftool/dbconfig/20191217-061959-marostegui.json [06:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:35] !log remove BGP session to AS32934 in ulsfo - T239896 [06:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:40] T239896: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 [06:21:16] 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10ayounsi) 05Open→03Resolved a:03ayounsi [06:21:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9892 and previous config saved to /var/cache/conftool/dbconfig/20191217-062136-marostegui.json [06:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:43] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [06:27:27] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9893 and previous config saved to /var/cache/conftool/dbconfig/20191217-063121-marostegui.json [06:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:28] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [06:35:50] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) @Sotiale: If the question is if rename processing can continue as usual, the answe... [06:36:23] !log volker-e@deploy1001 Started deploy [design/style-guide@73d51f0]: Deploy design/style-guide: [06:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:30] !log volker-e@deploy1001 Finished deploy [design/style-guide@73d51f0]: Deploy design/style-guide: (duration: 00m 07s) [06:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9894 and previous config saved to /var/cache/conftool/dbconfig/20191217-064030-marostegui.json [06:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:37] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [06:47:32] (03PS1) 10Marostegui: db1107: Install mariadb 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/558338 [06:48:01] !log pool maps2002. Postgres init is complete - T239728 [06:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:06] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [06:48:20] (03CR) 10Muehlenhoff: [C: 04-1] "We can have that much simpler: drbd8-utils in stretch is a transition package, i.e. a deb which only purpose is to depend on the new name " [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [06:50:38] !log depool maps2003 for postgres init - T239728 [06:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:49] (03PS1) 10Phedenskog: icinga: Add WebPageReplay alerts for ru.wiki [puppet] - 10https://gerrit.wikimedia.org/r/558339 (https://phabricator.wikimedia.org/T198287) [06:51:13] (03CR) 10Marostegui: [C: 03+2] db1107: Install mariadb 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/558338 (owner: 10Marostegui) [06:54:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557847 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [06:54:53] (03Merged) 10jenkins-bot: cxserver: enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557847 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [07:01:21] (03PS1) 10Marostegui: mariadb: Set 10.4 as default on buster [puppet] - 10https://gerrit.wikimedia.org/r/558340 [07:02:08] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1001/20015/db1107.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/558340 (owner: 10Marostegui) [07:02:42] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [07:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Set 10.4 as default on buster [puppet] - 10https://gerrit.wikimedia.org/r/558340 (owner: 10Marostegui) [07:07:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1130 T240823', diff saved to https://phabricator.wikimedia.org/P9895 and previous config saved to /var/cache/conftool/dbconfig/20191217-070709-marostegui.json [07:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:17] T240823: db1130 BBU possible issues - https://phabricator.wikimedia.org/T240823 [07:10:23] (03PS1) 10Samwilson: Enable $wgAllowRequiringEmailForResets on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558341 (https://phabricator.wikimedia.org/T240736) [07:11:00] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [07:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:47] (03CR) 10Muehlenhoff: "Looks good in general, two comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) (owner: 10Ema) [07:15:10] !log Upgrade db2100 [07:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] !log andrew@deploy1001 Started deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' [07:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:02] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:21:48] !log andrew@deploy1001 Finished deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' (duration: 03m 46s) [07:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:58] !log Upgrade candidate masters in codfw db2080 db2103 db2104 db2110 db2113 db2121 db2127 [07:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] !log andrew@deploy1001 Started deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' again (with fresh venvs) [07:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:49] !log andrew@deploy1001 Finished deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' again (with fresh venvs) (duration: 00m 10s) [07:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:28] !log andrew@deploy1001 Started deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' again (with fresh venvs) [07:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:13] (03PS1) 10Andrew Bogott: move labweb1001/1002 to horizon 'train' [puppet] - 10https://gerrit.wikimedia.org/r/558348 (https://phabricator.wikimedia.org/T239974) [07:28:07] (03CR) 10Andrew Bogott: [C: 03+2] move labweb1001/1002 to horizon 'train' [puppet] - 10https://gerrit.wikimedia.org/r/558348 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [07:29:11] !log andrew@deploy1001 Finished deploy [horizon/deploy@ff67a19]: Updating to Horizon version 'train' again (with fresh venvs) (duration: 03m 43s) [07:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:27] (03PS1) 10Elukey: Fix Analytics /mnt/hdfs readability checks after Kerberos [puppet] - 10https://gerrit.wikimedia.org/r/558349 [07:35:57] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) I've created https://wikitech.wikimedia.o... [07:39:14] (03PS1) 10Elukey: Add fake kerberos keytabs for stat/notebook hosts [labs/private] - 10https://gerrit.wikimedia.org/r/558351 [07:39:54] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for stat/notebook hosts [labs/private] - 10https://gerrit.wikimedia.org/r/558351 (owner: 10Elukey) [07:41:34] (03PS1) 10Alexandros Kosiaris: k8s: Migrate staging to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558353 (https://phabricator.wikimedia.org/T239835) [07:41:36] (03PS1) 10Alexandros Kosiaris: k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) [07:41:38] (03PS1) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [07:45:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20017/" [puppet] - 10https://gerrit.wikimedia.org/r/558349 (owner: 10Elukey) [07:46:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558349 (owner: 10Elukey) [07:51:14] (03PS1) 10Elukey: role::statistics::private: use kerberos with Search jobs [puppet] - 10https://gerrit.wikimedia.org/r/558358 [07:52:07] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [07:52:31] (03CR) 10Elukey: [C: 03+2] role::statistics::private: use kerberos with Search jobs [puppet] - 10https://gerrit.wikimedia.org/r/558358 (owner: 10Elukey) [07:55:43] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:55:54] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:56:09] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:01] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:57:54] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: Add WebPageReplay alerts for ru.wiki [puppet] - 10https://gerrit.wikimedia.org/r/558339 (https://phabricator.wikimedia.org/T198287) (owner: 10Phedenskog) [08:05:12] (03PS1) 10Elukey: profile::analytics::cluster::client: refactor hdfs mount code [puppet] - 10https://gerrit.wikimedia.org/r/558364 [08:07:40] (03PS2) 10Alexandros Kosiaris: k8s: Migrate staging to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558353 (https://phabricator.wikimedia.org/T239835) [08:07:42] (03PS2) 10Alexandros Kosiaris: k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) [08:07:44] (03PS2) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [08:07:46] (03PS1) 10Alexandros Kosiaris: calico: Parameterize calico datastore type [puppet] - 10https://gerrit.wikimedia.org/r/558365 (https://phabricator.wikimedia.org/T239835) [08:10:49] (03PS1) 10Andrew Bogott: glance policy.json: fix a typo in the manage_image_cache policy [puppet] - 10https://gerrit.wikimedia.org/r/558369 [08:11:54] (03CR) 10Andrew Bogott: [C: 03+2] glance policy.json: fix a typo in the manage_image_cache policy [puppet] - 10https://gerrit.wikimedia.org/r/558369 (owner: 10Andrew Bogott) [08:14:34] (03PS3) 10Muehlenhoff: ganeti: use 'drbd-utils' package [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:17:48] (03PS1) 10Giuseppe Lavagetto: cxserver: remove the securityContext from the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/558373 [08:18:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] calico: Parameterize calico datastore type [puppet] - 10https://gerrit.wikimedia.org/r/558365 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [08:19:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Migrate staging to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558353 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [08:22:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: remove the securityContext from the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/558373 (owner: 10Giuseppe Lavagetto) [08:23:08] (03PS1) 10Andrew Bogott: Horizon: use glance API version 2 [puppet] - 10https://gerrit.wikimedia.org/r/558375 [08:25:26] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: use glance API version 2 [puppet] - 10https://gerrit.wikimedia.org/r/558375 (owner: 10Andrew Bogott) [08:26:27] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [08:28:42] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/558377 [08:29:44] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/558377 (owner: 10Marostegui) [08:30:22] (03PS2) 10Alexandros Kosiaris: calico: Parameterize calico datastore type [puppet] - 10https://gerrit.wikimedia.org/r/558365 (https://phabricator.wikimedia.org/T239835) [08:30:24] (03PS3) 10Alexandros Kosiaris: k8s: Migrate staging to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558353 (https://phabricator.wikimedia.org/T239835) [08:30:26] (03PS3) 10Alexandros Kosiaris: k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) [08:30:28] (03PS3) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [08:31:04] !log Repool labsdb1010 T238399 [08:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:11] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [08:31:29] (03PS4) 10Muehlenhoff: ganeti: use 'drbd-utils' package [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:37:47] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: use 'drbd-utils' package [puppet] - 10https://gerrit.wikimedia.org/r/558214 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:43:21] (03PS1) 10Jcrespo: bacula: Fix small typos on pool class documentation [puppet] - 10https://gerrit.wikimedia.org/r/558383 [08:44:05] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1238.eqiad.wmnet [08:44:06] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1239.eqiad.wmnet [08:44:07] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1240.eqiad.wmnet [08:44:08] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1241.eqiad.wmnet [08:44:09] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1242.eqiad.wmnet [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:10] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1243.eqiad.wmnet [08:44:12] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1244.eqiad.wmnet [08:44:13] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1245.eqiad.wmnet [08:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:14] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1246.eqiad.wmnet [08:44:15] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1247.eqiad.wmnet [08:44:16] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1248.eqiad.wmnet [08:44:16] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1249.eqiad.wmnet [08:44:17] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1250.eqiad.wmnet [08:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:18] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1251.eqiad.wmnet [08:44:20] !log jiji@cumin1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1252.eqiad.wmnet [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:53] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [08:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] 10Operations: https://noc.wikimedia.org/conf/highlight.php returns 404, so all links from noc. are broken. - https://phabricator.wikimedia.org/T240928 (10Addshore) [08:47:52] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10SandraF_WMF) [08:53:08] (03PS5) 10Ayounsi: Add vlan support for asw [homer/public] - 10https://gerrit.wikimedia.org/r/550376 [08:54:49] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add vlan support for asw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [08:55:19] (03CR) 10Muehlenhoff: "rbash is unrelated, I suggest you switch to "rush" as an rssh replacement, it's also packaged in Debian." [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [08:56:48] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [08:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:44] mutante: maybe you know https://phabricator.wikimedia.org/T240929 [09:06:08] (03PS2) 10Elukey: profile::analytics::cluster::client: refactor hdfs mount code [puppet] - 10https://gerrit.wikimedia.org/r/558364 [09:06:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] mediawiki::php::admin fix lib.php (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) (owner: 10Effie Mouzeli) [09:08:13] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: refactor hdfs mount code [puppet] - 10https://gerrit.wikimedia.org/r/558364 (owner: 10Elukey) [09:13:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 s3 candidate master for upgrade', diff saved to https://phabricator.wikimedia.org/P9896 and previous config saved to /var/cache/conftool/dbconfig/20191217-091316-marostegui.json [09:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] !log delete all namespaces in kubernetes staging cluster for initialization with etcd3 backing datastore. T239835 [09:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:17] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [09:17:34] heads up, no services for a while on the kubernetes staging cluster. That may raise a few alerts, but it should be without any impact anywhere [09:20:20] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] rename cr2-knams to cr3, bundle knams-esams links [homer/public] - 10https://gerrit.wikimedia.org/r/552951 (owner: 10Ayounsi) [09:25:01] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:30:01] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 154160328 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:30:11] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [09:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] 10Operations, 10serviceops: kubestagetcd1003 alerts daily via email to root@ for 'unexpected non snapshot file' - https://phabricator.wikimedia.org/T240932 (10elukey) [09:37:15] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 807752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:42:37] (03CR) 10Hashar: "That was the subject of T225764 . Thank you!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/555751 (owner: 10Kosta Harlan) [09:42:54] (03PS1) 10Volans: Updated changelog for first release. [software/homer] - 10https://gerrit.wikimedia.org/r/558435 (https://phabricator.wikimedia.org/T228388) [09:44:05] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 54167776 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075', diff saved to https://phabricator.wikimedia.org/P9899 and previous config saved to /var/cache/conftool/dbconfig/20191217-094418-marostegui.json [09:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:55] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6640 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:48:28] (03CR) 10Volans: [C: 03+2] Updated changelog for first release. [software/homer] - 10https://gerrit.wikimedia.org/r/558435 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [09:48:35] 10Operations, 10cloud-services-team, 10netops: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10fgiunchedi) FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how f... [09:50:59] (03Merged) 10jenkins-bot: Updated changelog for first release. [software/homer] - 10https://gerrit.wikimedia.org/r/558435 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [09:51:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075', diff saved to https://phabricator.wikimedia.org/P9900 and previous config saved to /var/cache/conftool/dbconfig/20191217-095152-marostegui.json [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:13] (03PS1) 10Ayounsi: mr: add group stanza [homer/public] - 10https://gerrit.wikimedia.org/r/558438 [09:52:27] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Mathew.onipe) @Igorkim78 can you document the config changes here? [09:53:00] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] mr: add group stanza [homer/public] - 10https://gerrit.wikimedia.org/r/558438 (owner: 10Ayounsi) [10:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075', diff saved to https://phabricator.wikimedia.org/P9901 and previous config saved to /var/cache/conftool/dbconfig/20191217-100250-marostegui.json [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:36] (03PS1) 10Volans: Release v0.1.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/558442 (https://phabricator.wikimedia.org/T228388) [10:06:23] (03CR) 10Ayounsi: [C: 03+1] Release v0.1.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/558442 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:07:46] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.1.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/558442 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:11:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1075', diff saved to https://phabricator.wikimedia.org/P9902 and previous config saved to /var/cache/conftool/dbconfig/20191217-101125-marostegui.json [10:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10hashar) 05Open→03Resolved ` puppet/modules/admin/data(productionu=)$ ./matrix.py... [10:13:12] !log volans@deploy1001 Started deploy [homer/deploy@996f7be]: Homer release v0.1.0 - T228388 [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:18] T228388: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 [10:13:45] !log volans@deploy1001 Finished deploy [homer/deploy@996f7be]: Homer release v0.1.0 - T228388 (duration: 00m 32s) [10:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:07] (03PS1) 10Ayounsi: cr: add routing-instances [homer/public] - 10https://gerrit.wikimedia.org/r/558446 [10:17:21] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] cr: add routing-instances [homer/public] - 10https://gerrit.wikimedia.org/r/558446 (owner: 10Ayounsi) [10:17:28] 10Operations, 10SRE-tools, 10netops, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) [10:29:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 - s1 candidate master for upgrade', diff saved to https://phabricator.wikimedia.org/P9905 and previous config saved to /var/cache/conftool/dbconfig/20191217-102907-marostegui.json [10:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:59] (03CR) 10Ema: vhtcpd: convert service to systemd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) (owner: 10Ema) [10:31:48] (03PS4) 10Ema: vhtcpd: convert service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) [10:33:46] (03PS1) 10Elukey: Add Spark encryption settings to the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/558453 (https://phabricator.wikimedia.org/T240934) [10:33:56] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:35:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) (owner: 10Ema) [10:38:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P9907 and previous config saved to /var/cache/conftool/dbconfig/20191217-103810-marostegui.json [10:38:12] (03PS1) 10Ayounsi: Homer: add Netbox config [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) [10:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:56] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20023/" [puppet] - 10https://gerrit.wikimedia.org/r/558453 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [10:41:25] (03PS1) 10Ayounsi: CR: use FNM recommended values for flow-monitoring [homer/public] - 10https://gerrit.wikimedia.org/r/558457 [10:42:15] (03CR) 10Volans: "nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [10:44:56] (03CR) 10ArielGlenn: [C: 03+1] "Worth trying it but let's somehow remind ourselves to check on this eery 3 months or so to make sure that the ips are the same and/or that" [puppet] - 10https://gerrit.wikimedia.org/r/557137 (owner: 10Bstorm) [10:45:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P9908 and previous config saved to /var/cache/conftool/dbconfig/20191217-104530-marostegui.json [10:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P9910 and previous config saved to /var/cache/conftool/dbconfig/20191217-105425-marostegui.json [10:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:38] (03CR) 10Ayounsi: [C: 03+1] homer: add netbox credentials to the configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:56:04] jouncebot: reload [10:56:10] jouncebot: refresh [10:56:11] I refreshed my knowledge about deployments. [10:56:15] jouncebot: next [10:56:15] In 1 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1200) [10:56:15] In 1 hour(s) and 3 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1200) [10:56:28] (03Abandoned) 10Volans: homer: add netbox credentials to the configuration [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:56:41] Meh, oh well. kart_: Don't suppose you're around now if I push your config patch? [10:56:52] James_F: yes. [10:56:57] (03PS2) 10Ayounsi: Homer: add Netbox config [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) [10:56:59] Let's do it now, then. [10:57:19] (03PS11) 10Jforrester: ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:57:28] (03CR) 10Jforrester: [C: 03+2] ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:57:40] I keep forgetting that SWAT is one hour after, but I'm ready anyway :D [10:57:57] (03CR) 10Volans: [C: 03+1] "LGTM, just check that the compiler is happy" [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [10:58:00] jouncebot: refresh [10:58:00] I refreshed my knowledge about deployments. [10:58:02] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/20025/cumin1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [10:58:05] jouncebot: next [10:58:05] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1100) [10:58:14] There, now it's not one hour later. ;-) [10:58:25] (03Merged) 10jenkins-bot: ContentTranslation: Make available by default for new Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:58:46] (03CR) 10Ayounsi: [C: 03+2] Homer: add Netbox config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558456 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [10:58:50] James_F: haha. [10:59:13] kart_: Live on mwdebug1001; look OK to you? [10:59:37] (03PS2) 10Effie Mouzeli: mediawiki::php::admin memory optimisation for lib.php [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1100). Please do the needful. [11:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:31] James_F: give me a minute to test. [11:02:45] James_F: looks OK. Go ahead. [11:03:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1134', diff saved to https://phabricator.wikimedia.org/P9911 and previous config saved to /var/cache/conftool/dbconfig/20191217-110322-marostegui.json [11:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:16] OK. [11:05:18] (03PS1) 10Ayounsi: Add homer user and public key to all devices [homer/public] - 10https://gerrit.wikimedia.org/r/558465 [11:05:35] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T234318 ContentTranslation: Make available by default for new Wikipedias (duration: 00m 58s) [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:41] T234318: Enable Content translation as a default tool (out of beta) for newly created Wikipedias - https://phabricator.wikimedia.org/T234318 [11:06:20] kart_: All done? [11:06:24] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@665d9d3]: (no justification provided) [11:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:50] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@665d9d3]: (no justification provided) (duration: 00m 25s) [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:04] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [11:07:22] James_F: yeah. Thanks a lot! [11:08:38] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add homer user and public key to all devices [homer/public] - 10https://gerrit.wikimedia.org/r/558465 (owner: 10Ayounsi) [11:08:55] Excellent. SWAT done. [11:08:59] :D [11:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1091', diff saved to https://phabricator.wikimedia.org/P9912 and previous config saved to /var/cache/conftool/dbconfig/20191217-111400-marostegui.json [11:14:02] (03CR) 10Ema: [C: 03+2] vhtcpd: convert service to systemd [puppet] - 10https://gerrit.wikimedia.org/r/558138 (https://phabricator.wikimedia.org/T240826) (owner: 10Ema) [11:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:54] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [11:20:28] (03PS1) 10Alexandros Kosiaris: Switch staging calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558471 (https://phabricator.wikimedia.org/T239835) [11:20:30] (03PS1) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [11:20:32] (03PS1) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [11:21:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [11:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1091', diff saved to https://phabricator.wikimedia.org/P9913 and previous config saved to /var/cache/conftool/dbconfig/20191217-112134-marostegui.json [11:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch staging calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558471 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:21:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch staging calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558471 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:22:12] (03Merged) 10jenkins-bot: Switch staging calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558471 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:25:50] (03PS2) 10Arturo Borrero Gonzalez: dynamicproxy: add backend information to access log entries [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) [11:27:39] (03PS1) 10Ema: vhtcpd: pass -r only once in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/558474 (https://phabricator.wikimedia.org/T240826) [11:27:47] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=1 # T237984 [11:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:53] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [11:28:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Parameterize calico datastore type [puppet] - 10https://gerrit.wikimedia.org/r/558365 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:29:22] (03PS1) 10WMDE-leszek: Phragile: Added PHP extensions needed by PHP 7 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) [11:30:07] (03PS2) 10Ema: vhtcpd: pass -r only once in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/558474 (https://phabricator.wikimedia.org/T240826) [11:30:17] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] CR: use FNM recommended values for flow-monitoring [homer/public] - 10https://gerrit.wikimedia.org/r/558457 (owner: 10Ayounsi) [11:30:19] (03CR) 10Volans: [C: 03+1] "LGTM! I've tested extensively the UI and looks great! Thanks a lot of all the effort. One minor thing inline." (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [11:30:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Migrate staging to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558353 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:33:28] (03PS3) 10Ema: vhtcpd: pass -r only once in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/558474 (https://phabricator.wikimedia.org/T240826) [11:33:30] (03CR) 10WMDE-leszek: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) (owner: 10WMDE-leszek) [11:34:23] !log finished my last maint script run early after https://phabricator.wikimedia.org/P9914 [11:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:23] (03CR) 10Ema: [C: 03+2] vhtcpd: pass -r only once in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/558474 (https://phabricator.wikimedia.org/T240826) (owner: 10Ema) [11:36:35] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10Peachey88) [11:37:29] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [11:37:29] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [11:37:29] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:34] (03PS1) 10Ayounsi: Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1200) [12:00:19] (03CR) 10jerkins-bot: [V: 04-1] Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:00:27] (03CR) 10Volans: "Strange diff inline" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:01:12] (03CR) 10WMDE-Fisch: [C: 03+1] Phragile: Added PHP extensions needed by PHP 7 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) (owner: 10WMDE-leszek) [12:07:18] (03PS12) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [12:07:26] (03CR) 10Muehlenhoff: Add image tracking support (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [12:09:44] (03PS2) 10Ayounsi: Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) [12:10:09] (03CR) 10Volans: [C: 03+1] "LGTM, ship it!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [12:11:48] 10Operations, 10netops, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10aborrero) >>! In T240789#5747102, @fgiunchedi wrote: > FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS range... [12:12:31] (03CR) 10jerkins-bot: [V: 04-1] Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:21:34] (03PS3) 10Ayounsi: Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) [12:24:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:25:43] (03CR) 10Ayounsi: [C: 03+2] Make transport username configurable [software/homer] - 10https://gerrit.wikimedia.org/r/558479 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:27:41] (03CR) 10Muehlenhoff: [C: 03+2] Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [12:30:01] (03PS1) 10Ayounsi: Updated changelog for second release. [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) [12:30:40] 10Operations, 10Puppet, 10User-jbond: Clean up SSL configueration - https://phabricator.wikimedia.org/T240941 (10jbond) [12:30:42] 10Operations, 10Puppet, 10User-jbond: Clean up SSL configueration - https://phabricator.wikimedia.org/T240941 (10jbond) [12:31:39] (03PS3) 10Effie Mouzeli: mediawiki::php::admin memory optimisation for lib.php [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) [12:31:42] (03CR) 10Volans: [C: 04-1] "Small nits" (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:37:14] (03PS2) 10Ayounsi: Updated changelog for v0.1.1. [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) [12:38:02] (03CR) 10Ayounsi: Updated changelog for v0.1.1. (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:40:14] (03PS1) 10Jforrester: Group0 to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558494 [12:40:37] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:40:49] (03PS3) 10Ayounsi: Updated changelog for v0.1.1. [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) [12:41:16] !log Running scap clean for wmf.8 T233859 [12:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [12:43:13] tarrow: Everything from Wikibase landed? [12:43:42] aye; it was just that single revert [12:43:48] Cool. [12:43:58] (03CR) 10Ayounsi: [C: 03+2] Updated changelog for v0.1.1. [software/homer] - 10https://gerrit.wikimedia.org/r/558491 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [12:45:33] (03CR) 10Phamhi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:46:55] (03PS1) 10Jbond: logstash: remove base::expose_puppet_certs. [puppet] - 10https://gerrit.wikimedia.org/r/558496 [12:47:20] (03PS2) 10Jbond: logstash: remove base::expose_puppet_certs. [puppet] - 10https://gerrit.wikimedia.org/r/558496 (https://phabricator.wikimedia.org/T240941) [12:50:55] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.8 (duration: 09m 25s) [12:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:36] (03PS1) 10Ayounsi: Homer: specify username to connect to devices [puppet] - 10https://gerrit.wikimedia.org/r/558497 (https://phabricator.wikimedia.org/T228388) [12:57:55] (03PS1) 10Ayounsi: Release v0.1.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/558499 [12:58:00] !log jforrester@deploy1001 Started scap: testwiki to php-1.35.0-wmf.11 and rebuild l10n cache T233859 [12:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:06] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [12:58:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "OK, we can evaluate other shells later." [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:59:01] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Release v0.1.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/558499 (owner: 10Ayounsi) [12:59:33] (03CR) 10Phamhi: [C: 03+2] wmcs: monitoring: remove rssh [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:59:44] (03PS4) 10Phamhi: wmcs: monitoring: remove rssh [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) [13:00:04] James_F and longma: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - Europeam Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1300). [13:00:09] * James_F grins. [13:00:15] (Full scap on-going.) [13:01:20] !log ayounsi@deploy1001 Started deploy [homer/deploy@359de04]: Homer release v0.1.1 - T228388 [13:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:26] T228388: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 [13:01:44] !log ayounsi@deploy1001 Finished deploy [homer/deploy@359de04]: Homer release v0.1.1 - T228388 (duration: 00m 30s) [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:25] !log installing intel-microcode updates to 3.20191115.1 on buster/stretch (with one Xeon type rolled back to a regression causing failing reboots, see DSA 4562-2) [13:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:39] (03CR) 10Ayounsi: [C: 03+2] Homer: specify username to connect to devices [puppet] - 10https://gerrit.wikimedia.org/r/558497 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [13:05:39] (03PS1) 10Jbond: nginx::simple_tlsproxy: remove class [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) [13:08:08] (03CR) 10Effie Mouzeli: mediawiki::php::admin memory optimisation for lib.php (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) (owner: 10Effie Mouzeli) [13:08:57] (03PS2) 10Jbond: nginx::simple_tlsproxy: remove class [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) [13:08:59] (03PS4) 10Effie Mouzeli: mediawiki::php::admin memory optimisation for lib.php [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) [13:13:10] (03CR) 10Vgutierrez: [C: 03+1] nginx::simple_tlsproxy: remove class [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:14:39] (03PS1) 10Ayounsi: Homer: set SSH_AUTH_SOCK [puppet] - 10https://gerrit.wikimedia.org/r/558504 (https://phabricator.wikimedia.org/T228388) [13:14:48] (03PS2) 10Ema: ATS: stop logging BereqURL at the TLS layer too [puppet] - 10https://gerrit.wikimedia.org/r/556142 (https://phabricator.wikimedia.org/T237608) [13:17:08] (03CR) 10Ema: [C: 03+2] ATS: stop logging BereqURL at the TLS layer too [puppet] - 10https://gerrit.wikimedia.org/r/556142 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [13:17:24] (03PS1) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/558506 (https://phabricator.wikimedia.org/T224585) [13:21:00] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) It looks like cloudmetrics1002 IP 10.64.4.15 is already in the ACL. We will try https://gerrit.wikimedia.org/r/c/operat... [13:21:37] 10Operations, 10Traffic, 10Patch-For-Review: ATS skipping certain logs due to lack of buffer space - https://phabricator.wikimedia.org/T237608 (10ema) 05Open→03Resolved a:03ema We have bumped buffer sizes, decreased the amount of information being logged, and added icinga checks alerting if logs are sk... [13:22:08] (03CR) 10Phamhi: "Network confirmed that cloudmetrics1002 IP is already in the network ACLs" [puppet] - 10https://gerrit.wikimedia.org/r/558506 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:24:09] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5006 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:24:51] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3059 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:24:58] 10Operations, 10Traffic: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10ema) 05Open→03Resolved a:03ema Now that systemd properly takes care of supervising `vhtcpd.service`, the service is automatically restarted upon SIGSEGV: ` Dec 17 13:13:35 cp3050 systemd... [13:25:03] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3065 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:25:29] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3053 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:25:39] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3055 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:25:41] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3050 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:25:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1091', diff saved to https://phabricator.wikimedia.org/P9915 and previous config saved to /var/cache/conftool/dbconfig/20191217-132554-marostegui.json [13:25:56] looking ^ [13:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:21] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5001 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:25] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp1085 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:31] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp1075 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:45] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3064 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:55] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:59] strange, atslog-tls does work as expected [13:28:15] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5009 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:28:21] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5011 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:28:49] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3056 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:29:01] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) >>! In T107610#5726992, @Marostegui wrote: > The new hosts for es4 and es5 have been ordered and wil... [13:29:47] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp1083 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:34:46] !log jforrester@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.11 and rebuild l10n cache T233859 (duration: 36m 46s) [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:51] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [13:34:56] (03PS2) 10Gehel: search: decommission elastic10[18-31].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/554517 (https://phabricator.wikimedia.org/T239821) [13:37:14] (03CR) 10Jforrester: [C: 03+2] Group0 to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558494 (owner: 10Jforrester) [13:38:07] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558494 (owner: 10Jforrester) [13:38:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Clean up SSL configueration - https://phabricator.wikimedia.org/T240941 (10jbond) [13:39:55] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.11 T233859 [13:40:00] (03CR) 10Arturo Borrero Gonzalez: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:01] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [13:40:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: make cloudmetrics1002 the primary instead of labmon1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558506 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:40:35] Train provisionally looks OK in group0. [13:41:02] (Jinxing myself, of course.) [13:42:39] (03CR) 10Gehel: [C: 03+2] search: decommission elastic10[18-31].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/554517 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [13:48:09] (03CR) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558506 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:48:26] (03CR) 10Volans: [C: 03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/558504 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [13:48:27] was there a problem with memcaching this morning? [13:48:36] !log gehel@cumin1001 START - Cookbook sre.hosts.decommission [13:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:27] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:33] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: `elastic1018.eqiad.wmnet` -... [13:51:05] (03CR) 10Ayounsi: [C: 03+2] Homer: set SSH_AUTH_SOCK [puppet] - 10https://gerrit.wikimedia.org/r/558504 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [13:51:16] !log gehel@cumin1001 START - Cookbook sre.hosts.decommission [13:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] (03PS1) 10Muehlenhoff: Bump CLI client version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558516 [13:53:27] (03CR) 10Phamhi: [C: 03+2] wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/558506 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:53:46] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:55] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by gehel@cumin1001 for hosts: `elastic[1019-1020,1022-1031]... [13:56:19] (03PS2) 10Muehlenhoff: Bump CLI client version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558516 [13:56:49] (03CR) 10Volans: [C: 03+1] "Ship it" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558516 (owner: 10Muehlenhoff) [13:58:52] (03CR) 10Muehlenhoff: [C: 03+2] Bump CLI client version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558516 (owner: 10Muehlenhoff) [14:01:42] (03PS1) 10Gehel: elasticsearch: decommission elastic10[18-31] [puppet] - 10https://gerrit.wikimedia.org/r/558521 (https://phabricator.wikimedia.org/T239821) [14:02:15] !log andrew@deploy1001 Started deploy [horizon/deploy@b95d700]: Deploying a fix to the puppet prefix tab [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:44] (03PS1) 10BBlack: dotls: use haproxy exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/558522 (https://phabricator.wikimedia.org/T239994) [14:03:49] (03CR) 10DCausse: [C: 03+1] elasticsearch: decommission elastic10[18-31] [puppet] - 10https://gerrit.wikimedia.org/r/558521 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [14:04:23] 10Operations, 10Traffic: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) [14:04:35] 10Operations, 10Traffic: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) p:05Triage→03Normal [14:05:40] !log andrew@deploy1001 Finished deploy [horizon/deploy@b95d700]: Deploying a fix to the puppet prefix tab (duration: 03m 26s) [14:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:12] (03PS2) 10Gehel: elasticsearch: decommission elastic10[18-31] [puppet] - 10https://gerrit.wikimedia.org/r/558521 (https://phabricator.wikimedia.org/T239821) [14:10:14] (03CR) 10Gehel: [C: 03+2] elasticsearch: decommission elastic10[18-31] [puppet] - 10https://gerrit.wikimedia.org/r/558521 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [14:10:46] !log Upgrade db2091 [14:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:23] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3050 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:13:03] 10Operations, 10Traffic: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) > I suspect that reloading ats would break logging at this point. That's not the case, reload works perfectly fine. Changes to the log format are reflected c... [14:14:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: raise default value for nproc [puppet] - 10https://gerrit.wikimedia.org/r/558523 (https://phabricator.wikimedia.org/T240925) [14:15:58] !log cp: rolling ats-tls-restart to clear issues caused by T240950 [14:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:04] T240950: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 [14:16:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: bastion: raise default value for nproc [puppet] - 10https://gerrit.wikimedia.org/r/558523 (https://phabricator.wikimedia.org/T240925) (owner: 10Arturo Borrero Gonzalez) [14:18:29] (03PS1) 10Gehel: elasticsearch: decommission elastic[1018-1031] [dns] - 10https://gerrit.wikimedia.org/r/558525 (https://phabricator.wikimedia.org/T239821) [14:18:59] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp1075 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:19:17] (03PS1) 10DCausse: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/558526 (https://phabricator.wikimedia.org/T238045) [14:22:27] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:23:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558496 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [14:25:16] (03CR) 10Jcrespo: "I was supposed to do actual work on this, but got interrupted. Will send further changes later." [puppet] - 10https://gerrit.wikimedia.org/r/558383 (owner: 10Jcrespo) [14:29:45] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:29:59] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3059 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:32:31] <_joe_> dcausse, gehel can you take a look at that elasticsearch alert? [14:32:50] looking [14:32:52] _joe_: sure, most probably related to the decom of old nodes [14:33:10] master should have already switched way before, checking [14:35:27] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3062 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:35:29] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) Yes, when @Pchelolo is back I'll work with him to figure out what exactly went w... [14:36:43] (03PS1) 10Muehlenhoff: Release v0.2.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/558535 [14:36:54] !log restarting elastic1054 for config change (new master) [14:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:53] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3053 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:38:33] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5009 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:05] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:40:47] !log tearing down tmux sessions generating nic_saturation kludge on some memcached hosts [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:05] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp1085 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:25] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5006 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:53] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5001 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:42:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:42:08] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:29] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3065 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:45:54] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:45:57] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:00] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:31] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1003/20030/cp1075.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:47:09] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) Thank you, Mholloway, your help here was highly appreciated! [14:47:28] (03CR) 10Vgutierrez: [C: 03+1] ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:47:29] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5011 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:17] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp1083 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:06] (03PS2) 10Gehel: elasticsearch: decommission elastic[1018-1031] [dns] - 10https://gerrit.wikimedia.org/r/558525 (https://phabricator.wikimedia.org/T239821) [14:49:32] (03PS1) 10Jforrester: [BETA] Enable real-time VisualEditor test on Beta Deployment Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558540 [14:49:49] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3064 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:50:18] (03CR) 10jerkins-bot: [V: 04-1] [BETA] Enable real-time VisualEditor test on Beta Deployment Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558540 (owner: 10Jforrester) [14:50:43] (03PS2) 10Jforrester: [BETA] Enable real-time VisualEditor test on Beta Deployment Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558540 [14:52:25] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3055 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:56:48] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/558525 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [14:57:07] (03CR) 10Jforrester: [C: 03+2] [BETA] Enable real-time VisualEditor test on Beta Deployment Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558540 (owner: 10Jforrester) [14:57:09] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3056 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:57:59] (03Merged) 10jenkins-bot: [BETA] Enable real-time VisualEditor test on Beta Deployment Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558540 (owner: 10Jforrester) [15:05:29] (03PS5) 10Ema: ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) [15:09:41] (03CR) 10Ema: [C: 03+2] ATS: separate wikidata sessions from others [puppet] - 10https://gerrit.wikimedia.org/r/558048 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:12:57] jouncebot: now [15:12:58] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [15:13:09] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=10 --from-id=3929876 # T237984 [15:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:15] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [15:13:55] (03CR) 10BBlack: [C: 03+2] dotls: use haproxy exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/558522 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [15:14:25] !log pool maps2003 after postgres init - T239728 [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:31] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [15:14:32] (03PS2) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [15:14:34] (03PS2) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [15:14:36] (03PS1) 10Alexandros Kosiaris: admin: Set dnsPolicy: Default for the calico controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558546 (https://phabricator.wikimedia.org/T239835) [15:15:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Set dnsPolicy: Default for the calico controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558546 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [15:15:50] (03Merged) 10jenkins-bot: admin: Set dnsPolicy: Default for the calico controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558546 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [15:17:26] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=10 --from-id=3929876 # T237984 (I stopped it at Processed up to page 14546856 (P501)) [15:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] (03PS4) 10Alexandros Kosiaris: k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) [15:20:26] (03PS4) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [15:20:28] (03PS1) 10Alexandros Kosiaris: calico::cni: Pass datastore_type as well [puppet] - 10https://gerrit.wikimedia.org/r/558547 (https://phabricator.wikimedia.org/T239835) [15:24:31] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:38:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:41:53] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Release v0.2.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/558535 (owner: 10Muehlenhoff) [15:44:54] (03CR) 10Ema: [C: 03+2] ATS: disable compress plugin in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/558089 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:50:12] (03PS2) 10Bstorm: dumpsdistribution: get around unreliable DNS with an IP hardcode [puppet] - 10https://gerrit.wikimedia.org/r/557137 [15:50:51] !log jmm@deploy1001 Started deploy [debmonitor/deploy@bb99a23]: Debmonitor release v0.2.0 - T237978 [15:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:57] T237978: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 [15:51:37] (03CR) 10Bstorm: [C: 03+2] dumpsdistribution: get around unreliable DNS with an IP hardcode [puppet] - 10https://gerrit.wikimedia.org/r/557137 (owner: 10Bstorm) [15:52:11] (03PS1) 10Elukey: Enable Spark RPC encryption for the Yarn shuffler [puppet] - 10https://gerrit.wikimedia.org/r/558563 (https://phabricator.wikimedia.org/T240934) [15:52:27] (03CR) 10Elukey: [C: 03+2] Enable Spark RPC encryption for the Yarn shuffler [puppet] - 10https://gerrit.wikimedia.org/r/558563 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [15:52:51] !log text@ulsfo rolling ats-backend-restart to disable compress plugin https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/558089/ T238495 [15:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:57] T238495: Late last visual change caused by rotating banner - https://phabricator.wikimedia.org/T238495 [15:53:02] (03PS2) 10Elukey: Enable Spark RPC encryption for the Yarn shuffler in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/558563 (https://phabricator.wikimedia.org/T240934) [15:54:26] (03PS1) 10Bstorm: Revert "dumpsdistribution: get around unreliable DNS with an IP hardcode" [puppet] - 10https://gerrit.wikimedia.org/r/558564 [15:55:44] (03CR) 10Bstorm: "Error: /Stage[main]/Ferm/Service[ferm]: Systemd restart for ferm failed!" [puppet] - 10https://gerrit.wikimedia.org/r/558564 (owner: 10Bstorm) [15:56:01] (03CR) 10Bstorm: [C: 03+2] Revert "dumpsdistribution: get around unreliable DNS with an IP hardcode" [puppet] - 10https://gerrit.wikimedia.org/r/558564 (owner: 10Bstorm) [15:57:17] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@bb99a23]: Debmonitor release v0.2.0 - T237978 (duration: 06m 25s) [15:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:22] T237978: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 [15:57:33] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:19] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:19] (03PS3) 10Bstorm: dumps distribution: increase the rate limit to 5MBps [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) [16:07:23] (03CR) 10Bstorm: "Going to go ahead and try this to see if there is any impact at all, just to keep moving on this." [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) (owner: 10Bstorm) [16:07:24] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.11/includes/specials/pagers/NewPagesPager.php: T240924 NewPagesPager: Fix namespace query conditions (duration: 01m 02s) [16:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] T240924: Special:NewPages - Exception caught inside exception handler when filtering using associated namespace - https://phabricator.wikimedia.org/T240924 [16:07:37] (03CR) 10Bstorm: [C: 03+2] dumps distribution: increase the rate limit to 5MBps [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) (owner: 10Bstorm) [16:12:05] (03PS1) 10Muehlenhoff: Bump changelog for 0.2.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/558575 [16:13:10] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/558575 (owner: 10Muehlenhoff) [16:14:10] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.10/includes/specials/pagers/NewPagesPager.php: T240924 NewPagesPager: Fix namespace query conditions (duration: 01m 03s) [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:16] T240924: Special:NewPages - Exception caught inside exception handler when filtering using associated namespace - https://phabricator.wikimedia.org/T240924 [16:16:04] (03CR) 10Muehlenhoff: [C: 03+2] Bump changelog for 0.2.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/558575 (owner: 10Muehlenhoff) [16:17:10] (03PS1) 10DCausse: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 [16:17:12] (03PS1) 10DCausse: [WIP] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [16:17:55] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [16:18:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [16:23:26] (03PS1) 10Ottomata: Install libsasl2-modules-gssapi-mit on jupyter notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/558579 [16:24:07] (03CR) 10Ottomata: [C: 03+2] Install libsasl2-modules-gssapi-mit on jupyter notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/558579 (owner: 10Ottomata) [16:30:26] (03PS3) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [16:32:22] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [16:35:09] (03Abandoned) 10Cwhite: proton: enable statsd_exporter and add matching rules to profile::proton [puppet] - 10https://gerrit.wikimedia.org/r/480259 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:38:17] (03PS1) 10Andrew Bogott: Add andrewbogott to the 'gratitude' contact list [puppet] - 10https://gerrit.wikimedia.org/r/558587 (https://phabricator.wikimedia.org/T238424) [16:38:19] (03PS1) 10Andrew Bogott: Add contact for the 'gratitude' project that uses sms-via-email [puppet] - 10https://gerrit.wikimedia.org/r/558588 (https://phabricator.wikimedia.org/T238424) [16:39:14] hm, self-ping [16:39:25] (03PS4) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [16:39:27] (03PS1) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [16:39:39] (03CR) 10Andrew Bogott: [C: 03+2] Add andrewbogott to the 'gratitude' contact list [puppet] - 10https://gerrit.wikimedia.org/r/558587 (https://phabricator.wikimedia.org/T238424) (owner: 10Andrew Bogott) [16:41:33] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [16:47:55] (03PS1) 10Muehlenhoff: Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) [16:48:58] (03CR) 10jerkins-bot: [V: 04-1] Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:49:49] !log Deploy schema change on commonswiki.logging on db1138 (s4 primary master) - T233135 [16:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:55] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [16:51:10] (03PS2) 10Muehlenhoff: Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) [16:52:18] (03PS5) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [16:53:05] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [16:55:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [16:56:23] (03PS6) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [16:58:36] (03CR) 10BryanDavis: [C: 03+1] nginx::simple_tlsproxy: remove class [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [17:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1700). [17:00:04] tgr and twentyafterfour: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:30] (03CR) 10Jbond: [C: 03+2] nginx::simple_tlsproxy: remove class [puppet] - 10https://gerrit.wikimedia.org/r/558501 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [17:00:33] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:49] <_joe_> tgr|away: oh sigh we forgot to merge your patch last week, eek [17:01:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [17:01:51] (03PS2) 10Andrew Bogott: Add contact for the 'gratitude' project that uses sms-via-email [puppet] - 10https://gerrit.wikimedia.org/r/558588 (https://phabricator.wikimedia.org/T238424) [17:03:16] (03CR) 10Andrew Bogott: [C: 03+2] Add contact for the 'gratitude' project that uses sms-via-email [puppet] - 10https://gerrit.wikimedia.org/r/558588 (https://phabricator.wikimedia.org/T238424) (owner: 10Andrew Bogott) [17:04:15] (03PS7) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:04:53] (03PS1) 10Elukey: Add Spark encryption settings to the Hadoop test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/558601 (https://phabricator.wikimedia.org/T240934) [17:06:10] !log uploading debmonitor-client 0.2.0 to apt.wikimedia.org (jessie/stretch/buster) T237978 [17:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:16] T237978: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 [17:07:10] <_joe_> twentyafterfour: I will have to amend tgr's patch [17:07:32] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for looking into this! See inline on standard.cfg usage" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [17:07:54] (03CR) 10Elukey: [C: 03+2] Add Spark encryption settings to the Hadoop test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/558601 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [17:09:29] (03PS1) 10Giuseppe Lavagetto: growthexperiments: fix interval declaration [puppet] - 10https://gerrit.wikimedia.org/r/558603 [17:09:45] (03PS3) 10Muehlenhoff: Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) [17:09:52] (03CR) 10Muehlenhoff: Switch cloudmetrics to the new unified partitioning scheme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [17:10:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [17:17:53] (03PS8) 10Jbond: stunnel: add stunnel module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:20:23] (03PS2) 10Giuseppe Lavagetto: growthexperiments: fix interval declaration [puppet] - 10https://gerrit.wikimedia.org/r/558603 [17:23:43] (03PS9) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:23:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] growthexperiments: fix interval declaration [puppet] - 10https://gerrit.wikimedia.org/r/558603 (owner: 10Giuseppe Lavagetto) [17:24:39] (03PS1) 10Andrew Bogott: rename gratitude-team to gratitudeteam [puppet] - 10https://gerrit.wikimedia.org/r/558609 (https://phabricator.wikimedia.org/T238424) [17:25:56] (03CR) 10jerkins-bot: [V: 04-1] stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [17:26:00] (03CR) 10Andrew Bogott: [C: 03+2] rename gratitude-team to gratitudeteam [puppet] - 10https://gerrit.wikimedia.org/r/558609 (https://phabricator.wikimedia.org/T238424) (owner: 10Andrew Bogott) [17:29:41] (03PS2) 10Jbond: apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 [17:30:09] (03PS10) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 [17:31:30] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: update to use stunnle client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [17:33:45] (03CR) 10Volans: [C: 04-1] "Some comments inline. Do you have a compiler result at hand?" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:41:35] (03CR) 10Jbond: [C: 03+2] logstash: remove base::expose_puppet_certs. [puppet] - 10https://gerrit.wikimedia.org/r/558496 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [17:53:23] (03PS1) 10Giuseppe Lavagetto: [WiP] Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [17:54:18] (03PS1) 10Krinkle: Disable wgLegacyJavaScriptGlobals on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558621 (https://phabricator.wikimedia.org/T72470) [17:54:26] (03CR) 10CRusnov: netbox: Add automation git machinery (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:55:26] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191217T1800). [18:05:37] (03CR) 10Bstorm: [C: 03+1] "If this is what is running effectively in toolsbeta, I'm good with it." [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [18:06:20] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@c30d801]: Update mobileapps to 5551575 [18:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:15] (03PS2) 10DCausse: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 [18:08:17] (03PS2) 10DCausse: [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [18:09:19] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [18:09:23] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [18:12:58] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@c30d801]: Update mobileapps to 5551575 (duration: 06m 38s) [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:20:11] (03PS2) 10WMDE-leszek: Phragile: Added PHP extensions needed by PHP 7 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) [18:29:25] (03PS3) 10DCausse: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 [18:29:27] (03PS3) 10DCausse: [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [18:30:24] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [18:31:54] (03CR) 10Volans: [C: 04-1] "replies inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [18:33:55] (03PS1) 10Herron: dns: add ganeti01.svc.ulsfo.wmnet cluster service address [dns] - 10https://gerrit.wikimedia.org/r/558633 (https://phabricator.wikimedia.org/T226444) [18:35:33] (03CR) 10Herron: [C: 03+2] dns: add ganeti01.svc.ulsfo.wmnet cluster service address [dns] - 10https://gerrit.wikimedia.org/r/558633 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [18:35:58] (03PS4) 10DCausse: [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [18:36:47] (03PS1) 10Eric Gardner: Disable MachineVision email notifications on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) [18:37:08] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [18:38:24] (03PS1) 10Jhedden: ceph: add secondary interface for cloudceph servers [dns] - 10https://gerrit.wikimedia.org/r/558636 (https://phabricator.wikimedia.org/T240965) [18:39:03] (03CR) 10Bstorm: [C: 03+1] "Looking through, I see the defaults effectively configure tools correctly. That said, there is one piece of this that is "untested" and th" [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [18:40:52] (03PS5) 10DCausse: [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [18:42:06] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [18:43:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:45:14] (03PS6) 10DCausse: [cirrus] add elastic mapping for ores drafttopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) [18:48:02] (03PS12) 10Bstorm: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [18:52:29] (03CR) 10Bstorm: [C: 03+2] toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [18:58:13] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/MachineVision: Add more info to MachineVisionEntitySaveException message (duration: 01m 07s) [18:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:54] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Add more info to MachineVisionEntitySaveException message (duration: 01m 08s) [19:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:09:42] (03CR) 10CRusnov: netbox: Add automation git machinery (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [19:11:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:20:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:25:39] (03CR) 10Mholloway: [C: 04-1] "After looking more closely, I think I would put the setting into CommonsSetting.php after all. We don't need to vary by wiki, which is th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) (owner: 10Eric Gardner) [19:27:20] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:27:22] (03CR) 10Mholloway: [C: 04-1] "Actually, in the first instance I would put the new setting after calling wfLoadExtension( 'MachineVision' ), but I am not sure it matters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) (owner: 10Eric Gardner) [19:30:54] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:35:08] (03PS2) 10Eric Gardner: Disable MachineVision email notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) [19:35:12] (03PS1) 10Herron: ganeti: update ganeti01.svc.ulsfo.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/558659 [19:35:16] (03PS2) 10Ottomata: Change eventgate-logging-external TLS port to 4392 [deployment-charts] - 10https://gerrit.wikimedia.org/r/558115 [19:36:20] (03CR) 10Ottomata: "Is there anything special (confctl) other than puppet that needs to happen to do this?" [puppet] - 10https://gerrit.wikimedia.org/r/558117 (owner: 10Ottomata) [19:36:22] (03CR) 10Herron: [C: 03+2] ganeti: update ganeti01.svc.ulsfo.wmnet certificate [puppet] - 10https://gerrit.wikimedia.org/r/558659 (owner: 10Herron) [19:37:51] (03CR) 10Mholloway: [C: 03+2] Disable MachineVision email notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) (owner: 10Eric Gardner) [19:38:05] jouncebot: now [19:38:05] No deployments scheduled for the next 4 hour(s) and 21 minute(s) [19:38:45] (03Merged) 10jenkins-bot: Disable MachineVision email notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558635 (https://phabricator.wikimedia.org/T240878) (owner: 10Eric Gardner) [19:39:48] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:41:07] (03PS1) 10Ottomata: Enable envoyproxy tls for schema.svc [puppet] - 10https://gerrit.wikimedia.org/r/558660 (https://phabricator.wikimedia.org/T233630) [19:41:43] !log mholloway-shell@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable MachineVision email notifications (T240878) (duration: 01m 07s) [19:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:52] T240878: Echo email notifications not bundling properly - https://phabricator.wikimedia.org/T240878 [19:49:44] !log Restarting CI Jenkins for plugins upgrades [19:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:57:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:09:36] (03PS1) 10CDanis: puppet-merge: close LOCKFD before checking ownership [puppet] - 10https://gerrit.wikimedia.org/r/558666 [20:11:21] (03PS1) 10BryanDavis: toolforge: Redirect / on urlproxy to /admin/ [puppet] - 10https://gerrit.wikimedia.org/r/558667 [20:17:08] (03CR) 10BryanDavis: [C: 03+1] "Tested manually on tools-proxy-06" [puppet] - 10https://gerrit.wikimedia.org/r/558667 (owner: 10BryanDavis) [20:18:22] (03CR) 10Bstorm: [C: 03+2] toolforge: Redirect / on urlproxy to /admin/ [puppet] - 10https://gerrit.wikimedia.org/r/558667 (owner: 10BryanDavis) [20:20:40] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/tag/toolforge/ [20:28:09] (03PS1) 10Bstorm: toolforge-shinken: the front page is now a redirect to the admin tool [puppet] - 10https://gerrit.wikimedia.org/r/558670 [20:28:21] (03PS5) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [20:32:12] (03CR) 10Ammarpad: Add minerva custom log for la.wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [20:33:32] (03PS2) 10Bstorm: toolforge-shinken: the front page is now a redirect to the admin tool [puppet] - 10https://gerrit.wikimedia.org/r/558670 [20:34:26] (03CR) 10BryanDavis: [C: 03+1] toolforge-shinken: the front page is now a redirect to the admin tool [puppet] - 10https://gerrit.wikimedia.org/r/558670 (owner: 10Bstorm) [20:35:17] (03PS3) 10Bstorm: toolforge-shinken: the front page is now a redirect to the admin tool [puppet] - 10https://gerrit.wikimedia.org/r/558670 [20:36:31] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "This is showing verified but the field is empty, so I'm going to override..." [puppet] - 10https://gerrit.wikimedia.org/r/558670 (owner: 10Bstorm) [20:48:26] (03PS1) 10BryanDavis: toolforge: Update toolviews.py nginx log parser [puppet] - 10https://gerrit.wikimedia.org/r/558676 (https://phabricator.wikimedia.org/T238641) [20:54:18] (03CR) 10Jdlrobson: Add minerva custom log for la.wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [21:02:20] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/MachineVision: Add more info to MachineVisionEntitySaveException message, take 2 (duration: 01m 04s) [21:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:13] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/MachineVision: Add more info to MachineVisionEntitySaveException message, take 2 (duration: 01m 04s) [21:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:56] (03CR) 10Ammarpad: Add minerva custom log for la.wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [21:12:16] (03CR) 10Jdlrobson: Add minerva custom log for la.wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [21:12:51] (03PS1) 10EBernhardson: airflow: Enable kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) [21:21:26] PROBLEM - nutcracker process on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [21:21:34] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:34] hm, I get "wikimedia error" currently, 503 backend fetch failed [21:22:02] PROBLEM - haproxy alive on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/HAProxy [21:22:04] PROBLEM - DPKG on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:22:54] Me too. 503 Backend fetch failed via cp3062 frontend [21:22:56] PROBLEM - MD RAID on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:22:58] PROBLEM - Disk space on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thumbor1003&var-datasource=eqiad+prometheus/ops [21:23:38] PROBLEM - puppet last run on thumbor1003 is CRITICAL: connect to address 10.64.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:24:46] RECOVERY - MD RAID on thumbor1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:24:46] RECOVERY - Disk space on thumbor1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thumbor1003&var-datasource=eqiad+prometheus/ops [21:25:00] RECOVERY - nutcracker process on thumbor1003 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [21:25:08] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:36] RECOVERY - haproxy alive on thumbor1003 is OK: OK check_alive uptime 6418516s https://wikitech.wikimedia.org/wiki/HAProxy [21:25:40] RECOVERY - DPKG on thumbor1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:26:30] (03PS6) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [21:27:47] (03CR) 10Ammarpad: Add minerva custom log for la.wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [21:29:26] RECOVERY - puppet last run on thumbor1003 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:35:21] (03PS3) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [21:35:23] (03PS3) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [21:35:25] (03PS1) 10Alexandros Kosiaris: Add a .gitmessage commit template [deployment-charts] - 10https://gerrit.wikimedia.org/r/558688 [21:35:27] (03PS1) 10Alexandros Kosiaris: RBAC: Add the tiller cluster role [deployment-charts] - 10https://gerrit.wikimedia.org/r/558689 (https://phabricator.wikimedia.org/T239835) [21:44:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but it should be split in 2 patches and the discovery records should be merged after LVS has been setup" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/556442 (owner: 10Herron) [21:52:10] still receiving reports from different people about those 503 errors [21:52:35] !log ✔️ cdanis@cp3050.esams.wmnet ~ 🕔🍵 sudo depool [21:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] lvs: add entries for logstash-next and kibana-next (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556443 (owner: 10Herron) [21:59:06] Sagan: should be better now [22:02:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico::cni: Pass datastore_type as well [puppet] - 10https://gerrit.wikimedia.org/r/558547 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [22:03:55] cdanis: thx for taking care. I will take a look if the problem occurs again, but for now it looks good :) [22:09:15] (03PS1) 10EBernhardson: scap: point discovery analytics at proper repository [puppet] - 10https://gerrit.wikimedia.org/r/558695 [22:09:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@100bf96]: Ship kerberos configuration for oozie jobs [22:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:19] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@100bf96]: Ship kerberos configuration for oozie jobs (duration: 00m 34s) [22:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a .gitmessage commit template [deployment-charts] - 10https://gerrit.wikimedia.org/r/558688 (owner: 10Alexandros Kosiaris) [22:13:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] RBAC: Add the tiller cluster role [deployment-charts] - 10https://gerrit.wikimedia.org/r/558689 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [22:13:45] (03Merged) 10jenkins-bot: Add a .gitmessage commit template [deployment-charts] - 10https://gerrit.wikimedia.org/r/558688 (owner: 10Alexandros Kosiaris) [22:13:59] (03Merged) 10jenkins-bot: RBAC: Add the tiller cluster role [deployment-charts] - 10https://gerrit.wikimedia.org/r/558689 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [22:18:42] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [22:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:41] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:04] (03PS4) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [22:30:06] (03PS4) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [22:30:08] (03PS1) 10Alexandros Kosiaris: Remove old redundant rbac/ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/558704 [22:30:10] (03PS1) 10Alexandros Kosiaris: RBAC: Add system:nodes group to system:node [deployment-charts] - 10https://gerrit.wikimedia.org/r/558705 (https://phabricator.wikimedia.org/T239835) [22:31:06] (03PS1) 10Jhedden: add forward and reverse for cloudcephmgr.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/558707 (https://phabricator.wikimedia.org/T240715) [22:39:15] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@0dd9f6b]: Ship kerberos configuration for oozie jobs [22:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:06] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@0dd9f6b]: Ship kerberos configuration for oozie jobs (duration: 15m 50s) [22:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:36] (03PS1) 10BryanDavis: Preserve tool name and path info in k8s ingress rewrite [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/558726 (https://phabricator.wikimedia.org/T241008) [23:22:24] (03CR) 10BryanDavis: Preserve tool name and path info in k8s ingress rewrite (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/558726 (https://phabricator.wikimedia.org/T241008) (owner: 10BryanDavis) [23:41:07] (03PS1) 10Cwhite: scb: add graphoid matching rules and deploy statsd exporter to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) [23:44:03] (03CR) 1020after4: [C: 03+1] Add a bit for forcing LC caching backend in cli mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [23:49:07] (03CR) 10Cwhite: [C: 03+1] "That's a lot of errors, it's not unexpected though. This looks great! Thank you for tackling it!" [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [23:50:50] (03CR) 10Cwhite: [C: 03+1] "This CS looks like it did the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/557061 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [23:51:41] (03PS1) 10Krinkle: varnish: Minor wording update for browsersec/sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/558735 (https://phabricator.wikimedia.org/T238038) [23:53:10] (03CR) 10Ladsgroup: "I will do this tomorrow. Any objections?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [23:54:23] (03PS3) 10EBernhardson: [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [23:54:31] (03PS3) 10EBernhardson: [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse)