[00:02:48] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4102901 (10ayounsi) Slightly related. I disabled all the interfaces on the access switches that are down and don't have a description by adding them in `interfaces interface-... [00:10:15] Dereckson: I'm sorry! I was in a meeting and missed your ping. Thanks for the patch. :) [00:21:30] You're welcome. [01:03:48] 10Operations, 10Phabricator: Phabricator is loading really slowly - https://phabricator.wikimedia.org/T191361#4102938 (10greg) [01:03:55] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4102940 (10greg) [01:09:48] (03CR) 10Krinkle: "As I understand it, wfLoadExtension queues an extension to be loaded. It does not itself load an extension (this differs from the old days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423809 (https://phabricator.wikimedia.org/T190353) (owner: 10Dereckson) [01:20:46] 10Operations, 10netops: Enabling graceful-switchover causes core dumps on cr1-codfw - https://phabricator.wikimedia.org/T191371#4102950 (10ayounsi) [01:35:35] (03PS1) 10Dzahn: installserver: cleanup roles including other roles, pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/423821 [01:36:03] (03CR) 10jerkins-bot: [V: 04-1] installserver: cleanup roles including other roles, pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/423821 (owner: 10Dzahn) [01:41:59] (03PS2) 10Dzahn: installserver: cleanup roles including other roles, pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/423821 [01:50:13] (03PS1) 10Ayounsi: Add new SSH key for myself. [puppet] - 10https://gerrit.wikimedia.org/r/423827 [01:58:52] (03CR) 10Dzahn: [C: 032] "noop on install and pop bastions (they require installserver::tftp)" [puppet] - 10https://gerrit.wikimedia.org/r/423821 (owner: 10Dzahn) [02:01:24] (03CR) 10Dzahn: [C: 032] "bastionhost::pop is also not affected because it already gets the same includes from bastionhost::general so the ones in the tftp class ar" [puppet] - 10https://gerrit.wikimedia.org/r/423821 (owner: 10Dzahn) [02:12:46] (03PS2) 10Dzahn: installserver: convert tftp role to profile [puppet] - 10https://gerrit.wikimedia.org/r/423787 [02:27:01] 10Operations, 10netops: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4103007 (10ayounsi) [02:27:10] 10Operations, 10netops: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4046514 (10ayounsi) 05Open>03Resolved [02:32:19] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.27) (duration: 05m 31s) [02:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:43] (03PS2) 10Chad: Gerrit: symlink in motd.config from deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/423735 [04:48:07] (03PS1) 10Bstorm: wiki replicas: correct small but critical bugs in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/423833 (https://phabricator.wikimedia.org/T181650) [05:14:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423834 (https://phabricator.wikimedia.org/T187089) [05:16:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423834 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:17:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423834 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:19:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 for alter table, kernel and mariadb upgrade (duration: 01m 17s) [05:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:35] !log Stop mariadb for upgrade and kernel upgrade on db1072 - this will generate lag on s3 labs [05:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:38] !log Drop click_tracking_events table from where it still exists - T115982 [05:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:44] T115982: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982 [05:48:21] !log Deploy schema change on db1072 - s3 - with replication. This will generate lag on labs T187089 T185128 T153182 [05:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:28] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:48:29] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:48:29] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:52:18] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#4103159 (10Marostegui) [05:52:49] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1737574 (10Marostegui) 05Open>03Resolved I have dropped the table everywhere where... [05:54:24] (03PS1) 10Marostegui: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423835 (https://phabricator.wikimedia.org/T188279) [05:56:04] (03PS1) 10Marostegui: db2083.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423836 (https://phabricator.wikimedia.org/T188279) [05:56:24] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423835 (https://phabricator.wikimedia.org/T188279) (owner: 10Marostegui) [05:56:50] (03CR) 10Marostegui: [C: 032] db2083.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423836 (https://phabricator.wikimedia.org/T188279) (owner: 10Marostegui) [05:57:23] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team: Request for "administrator" rights on beta cluster - https://phabricator.wikimedia.org/T191356#4103169 (10Hydriz) 05Open>03Resolved a:03Hydriz Done, administrator rights granted to User:Adamw@enwiki. [05:58:19] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423835 (https://phabricator.wikimedia.org/T188279) (owner: 10Marostegui) [05:59:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db2083 - T188279 (duration: 01m 17s) [06:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:00] T188279: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279 [06:20:36] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/423827 (owner: 10Ayounsi) [06:24:36] (03PS1) 10Elukey: prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) [06:31:00] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for Parsoid services on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/423718 (https://phabricator.wikimedia.org/T135991) [06:31:39] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for Parsoid services on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/423718 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:31:45] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:46:45] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:46:50] (03PS2) 10Elukey: prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) [07:00:09] (03PS3) 10Elukey: [WIP] prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) [07:02:00] (03PS1) 10Muehlenhoff: Update SSH key of Stas Malyshev [puppet] - 10https://gerrit.wikimedia.org/r/423841 [07:05:07] (03PS4) 10Elukey: [WIP] prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) [07:11:13] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4103280 (10hashar) 05Open>03Resolved a:03thcipriani @Ottomata is now listed as a member and project admin. I am assuming Tyler did it yest... [07:18:48] (03PS5) 10Elukey: prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) [07:23:46] (03CR) 10Elukey: "So from pcc I am not able to see any change in prometheus* hosts, but afaics it should do what I'd need, namely override cluster if passed" [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [07:24:37] (03PS1) 10Marostegui: db2055.yaml: Change its binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423842 (https://phabricator.wikimedia.org/T191275) [07:25:33] (03CR) 10Marostegui: [C: 032] db2055.yaml: Change its binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423842 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:26:10] !log Restart MySQL on db2055 to change its binlog to STATEMENT - T191275 [07:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:16] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [07:30:04] !log finish up cache@eqiad reboots for retpoline kernel updates T188092 [07:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:54] (03PS1) 10Marostegui: db-codfw.php: db2055 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423843 (https://phabricator.wikimedia.org/T191275) [07:33:41] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2055 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423843 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:35:00] (03Merged) 10jenkins-bot: db-codfw.php: db2055 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423843 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:36:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2083 - T188279 (duration: 01m 17s) [07:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:08] T188279: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279 [07:36:38] (03Abandoned) 10Elukey: prometheus::jmx_exporter_config: allow 'cluster' override [puppet] - 10https://gerrit.wikimedia.org/r/423838 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [07:37:29] !log running some apache/stretch tests on mw2261 [07:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:57] !log marostegui@tin Synchronized wmf-config/db-codfw.php: db2055 is now a candidate master - T191275 (duration: 01m 16s) [07:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:03] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [07:53:04] !log Drop flaggedrevs from s3 mediawikiwiki - T186865 [07:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:10] T186865: Drop flaggedrevs tables from mediawikiwiki - https://phabricator.wikimedia.org/T186865 [07:53:22] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10akosiaris) For what is worth, I don't like the idea of adding anything like that in `network::constants`. I don't even like the current `$special_hosts` construct (it has gotten out of ha... [07:53:49] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#4103414 (10akosiaris) 05Open>03declined I am gonna close this as declined. Feel free to reopen though. [07:57:04] (03PS1) 10Elukey: Add the -skipTrash option to hdfs -rm [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423845 (https://phabricator.wikimedia.org/T189051) [07:59:13] !log depool ms-fe2005 to test rewrite.py - T183902 [07:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:19] T183902: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902 [08:00:40] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423846 [08:02:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423846 (owner: 10Marostegui) [08:03:57] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423846 (owner: 10Marostegui) [08:05:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1072 (duration: 01m 17s) [08:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:12] !log Deploy schema change on s3 primary master (db1075) - T153182 T185128 [08:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:18] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [08:08:19] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [08:08:59] (03PS7) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [08:10:57] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423848 [08:17:53] !log start of ladsgroup@terbium:~$ mwscript deleteAutoPatrolLogs.php --wiki=mediawikiwiki --dry-run --check-old --before 20160423210426 [08:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:33] !log start of ladsgroup@terbium:~$ mwscript deleteAutoPatrolLogs.php --wiki=mediawikiwiki --check-old --before 20160423210426 (T184485) [08:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:39] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [08:21:22] 10Operations, 10Traffic, 10Patch-For-Review: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545#4103454 (10ema) 05Open>03Resolved a:03ema With https://gerrit.wikimedia.org/r/416652 being merged, this is done. [08:22:51] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968#4103463 (10ema) 05Open>03Resolved a:03ema I haven't seen this happening anymore in the past 3 months. Closing for now but feel free to reopen should the issue come back. [08:24:48] (03PS1) 10Elukey: [WIP]: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [08:25:34] (03PS5) 10Ema: varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) [08:26:04] (03CR) 10Ema: [C: 032] varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [08:27:52] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Release-Engineering-Team (Kanban), 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#4103472 (10akosiaris) 05Open>03Resolved a:03akosiaris Done. [08:28:42] (03PS1) 10Filippo Giunchedi: swift: deprecate webob usage in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) [08:34:50] (03PS2) 10Ema: Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [08:35:27] (03CR) 10Ema: [C: 032] Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [08:42:11] (03PS2) 10Filippo Giunchedi: swift: deprecate webob usage in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) [08:43:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423848 (owner: 10Marostegui) [08:44:19] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423848 (owner: 10Marostegui) [08:46:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1072 (duration: 01m 17s) [08:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:00] (03PS2) 10Elukey: [WIP]: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [08:54:54] (03PS1) 10Marostegui: db2041.yaml: Change binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423854 (https://phabricator.wikimedia.org/T191275) [08:55:52] (03CR) 10Marostegui: [C: 032] db2041.yaml: Change binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423854 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [08:56:49] (03CR) 10Elukey: "pcc still not showing up everything:" [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [08:57:21] (03PS1) 10Marostegui: db-codfw.php: Depool db2041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423855 [08:58:40] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423855 (owner: 10Marostegui) [08:59:22] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) (owner: 10Filippo Giunchedi) [08:59:52] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423855 (owner: 10Marostegui) [09:01:28] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2041 (duration: 01m 17s) [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:28] !log Stop MySQL on db2041 for binlog format change and kernel upgrade [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] !log start of mwscript deleteAutoPatrolLogs.php --wiki=mediawikiwiki--before 20180223210426 --sleep 2 (T184485) [09:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:42] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [09:03:40] (03PS1) 10Ema: varnishxcps: remove nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/423861 (https://phabricator.wikimedia.org/T184942) [09:05:05] 10Operations, 10Puppet: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388#4103530 (10Volans) [09:10:55] (03PS2) 10Ema: varnishxcps: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/423861 (https://phabricator.wikimedia.org/T184942) [09:16:23] !log executed systemctl reset-failed kafka-mirror-main-eqiad_to_jumbo-eqiad.service on kafka1020 [09:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:53] RECOVERY - Check systemd state on kafka1020 is OK: OK - running: The system is fully operational [09:19:04] (03PS1) 10Marostegui: db-codfw.php: db2041 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423862 (https://phabricator.wikimedia.org/T191275) [09:21:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2041 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423862 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [09:22:21] (03Merged) 10jenkins-bot: db-codfw.php: db2041 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423862 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [09:23:59] (03PS1) 10Marostegui: db-eqiad.php: Restore db1077 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423863 [09:24:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: db2041 is now a candidate master for s2 - T191275 (duration: 01m 16s) [09:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:21] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [09:25:17] !log end of the deleteAutoPatrolLogs.php script on mediawikiwiki (T184485) [09:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:24] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [09:28:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1077 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423863 (owner: 10Marostegui) [09:29:16] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1077 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423863 (owner: 10Marostegui) [09:30:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1077 (duration: 01m 16s) [09:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:03] (03PS1) 10Ema: cache_text: use eqiad debug appserver [puppet] - 10https://gerrit.wikimedia.org/r/423866 [09:46:20] (03CR) 10Ema: [C: 032] varnishxcps: post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/423861 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:46:26] !log Deploy schema change on s1 codfw master db2048 (this will generate lag on codfw) - T187089 T185128 T153182 [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:33] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [09:46:33] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [09:46:33] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [09:46:51] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4103641 (10MarcoAurelio) 05Resolved>03Open Puppet issue still not resolved. [09:47:52] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4103645 (10MarcoAurelio) a:05thcipriani>03None [09:48:29] PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:49:05] (03PS2) 10Filippo Giunchedi: base: enable exporting SMART metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/422112 (https://phabricator.wikimedia.org/T86552) [09:50:04] (03CR) 10Filippo Giunchedi: [C: 032] base: enable exporting SMART metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/422112 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [09:51:29] RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational [09:55:44] (03PS1) 10Ladsgroup: mediawiki: remove old and finished wikidata cronjob that has been stopped [puppet] - 10https://gerrit.wikimedia.org/r/423870 [09:56:34] (03PS1) 10Filippo Giunchedi: smart: fix apt::pin package definition [puppet] - 10https://gerrit.wikimedia.org/r/423871 (https://phabricator.wikimedia.org/T86552) [09:57:41] (03CR) 10Filippo Giunchedi: [C: 032] smart: fix apt::pin package definition [puppet] - 10https://gerrit.wikimedia.org/r/423871 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [09:57:43] (03PS2) 10Filippo Giunchedi: smart: fix apt::pin package definition [puppet] - 10https://gerrit.wikimedia.org/r/423871 (https://phabricator.wikimedia.org/T86552) [09:58:08] jynus: https://gerrit.wikimedia.org/r/423870 do you remember this? [09:58:55] I remember the task, not the specific CR [09:59:11] do you want me to deploy that? [09:59:50] PROBLEM - Check systemd state on lvs1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:00:05] Amir1: if yes, the best way, rather than IRC, is to add me as reviewer and say on a comment "please deploy" [10:00:21] checking lvs1002 [10:00:45] ema: that might be me merging https://gerrit.wikimedia.org/r/c/422112 btw, not sure [10:00:52] that's smartd yes [10:01:09] sigh, thanks I'll closer look [10:01:18] take even [10:01:55] thanks! [10:02:12] 10Operations, 10Puppet: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393#4103691 (10Volans) p:05Triage>03Normal [10:02:19] 10Operations, 10Puppet: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388#4103702 (10Volans) p:05Triage>03Normal [10:02:41] PROBLEM - Check systemd state on lvs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:40] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4103744 (10Volans) [10:04:42] 10Operations, 10Puppet, 10Patch-For-Review: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918#4103741 (10Volans) 05Open>03Resolved a:03Volans Reports are enabled since ~1 day without any incident. Resolving. [10:04:50] I'll disable smart on lvs for now via hiera [10:04:54] godog: need a hand for this? [10:05:30] volans: not at this very moment I think, I'll let you know tho [10:05:36] ack [10:09:51] jynus: sorry, got pulled in a meeting, it was the script to rebuild wb_terms table and populate term_full_entity_id, it has been stopped so it doesn't need deploy [10:09:57] just a clean up [10:10:42] s/deploy/merge/ [10:11:20] PROBLEM - Check systemd state on lvs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:08] (03PS1) 10Filippo Giunchedi: hieradata: exclude lvs100[1-6] with mpt controller from smart checks [puppet] - 10https://gerrit.wikimedia.org/r/423874 (https://phabricator.wikimedia.org/T86552) [10:16:57] (03PS2) 10Jcrespo: mediawiki: remove old and finished wikidata cronjob that has been stopped [puppet] - 10https://gerrit.wikimedia.org/r/423870 (owner: 10Ladsgroup) [10:17:21] Amir1: what I mean is that if I don't get added and asked there, I won't be able to act on it [10:17:23] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#3888215 (10faidon) In terms of code, what would the changes required be? What are these deprecation warnings that you mentioned above? Are we tracking fixes for these somewhere and are we making sure new ones don't crop up? [10:17:40] PROBLEM - Check systemd state on lvs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:17:44] by adding me as reviewer, it will go to my queue of pending reviews [10:18:09] and if you give me context "please merge", I know what you need from me [10:18:18] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10790/" [puppet] - 10https://gerrit.wikimedia.org/r/423874 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [10:18:23] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: exclude lvs100[1-6] with mpt controller from smart checks [puppet] - 10https://gerrit.wikimedia.org/r/423874 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [10:18:27] (03PS2) 10Filippo Giunchedi: hieradata: exclude lvs100[1-6] with mpt controller from smart checks [puppet] - 10https://gerrit.wikimedia.org/r/423874 (https://phabricator.wikimedia.org/T86552) [10:18:53] (03CR) 10Jcrespo: [C: 032] mediawiki: remove old and finished wikidata cronjob that has been stopped [puppet] - 10https://gerrit.wikimedia.org/r/423870 (owner: 10Ladsgroup) [10:19:42] (03PS3) 10Filippo Giunchedi: hieradata: exclude lvs100[1-6] with mpt controller from smart checks [puppet] - 10https://gerrit.wikimedia.org/r/423874 (https://phabricator.wikimedia.org/T86552) [10:20:08] godog: we have a fact with the raid controller [10:20:30] godog: if 'mpt' in $facts['raid'] { [10:20:36] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4103875 (10Marostegui) a:05RobH>03Cmjohnson Assigning to Chris to reflect the latest work that was done for this host [10:20:50] PROBLEM - Check systemd state on lvs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:21:56] paravoid: indeed, I ran into that problem (mpt not detected) before in T179078 [10:21:57] T179078: mpt raid controller not detected as fact on maps-test2* - https://phabricator.wikimedia.org/T179078 [10:22:19] ah! [10:23:37] (03CR) 10Jcrespo: [C: 031] "Everthing works as intended. Manuel: Test and merge if you are ok with it. It may need later more work for improvements, but that could be" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [10:23:39] at the time I thought it wasn't worth it to look into further, probably still true given that those are old lvs boxes [10:24:09] yeah [10:24:14] I thought it was the regular "mpt" [10:25:23] jynus: of course, you're right [10:25:29] my bad [10:26:08] I am not saying anything, just asking that so you do not have to wait [10:26:15] paravoid: *nod* yeah I remember being puzzled [10:28:07] 10Operations, 10Puppet, 10Tracking: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388#4103895 (10MarcoAurelio) > This is the main tracking task + #tracking [10:30:42] (03CR) 10Jcrespo: [C: 04-1] "Actually:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [10:33:48] RECOVERY - Check systemd state on lvs1004 is OK: OK - running: The system is fully operational [10:33:57] RECOVERY - Check systemd state on lvs1005 is OK: OK - running: The system is fully operational [10:33:57] RECOVERY - Check systemd state on lvs1006 is OK: OK - running: The system is fully operational [10:34:00] \o/ [10:34:07] RECOVERY - Check systemd state on lvs1002 is OK: OK - running: The system is fully operational [10:34:28] RECOVERY - Check systemd state on lvs1001 is OK: OK - running: The system is fully operational [10:36:42] 10Operations, 10monitoring, 10Patch-For-Review: mpt raid controller not detected as fact on maps-test2* - https://phabricator.wikimedia.org/T179078#4103922 (10fgiunchedi) `lvs100[1-6]` are also affected by this, and similarly are due to be decom'd soon too. [10:37:49] no criticals ATM on icinga \o/ [10:47:00] (03CR) 10Jcrespo: [C: 031] "Ok, false positive, it just requires a .dblist suffix, it just works like the original." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [10:51:36] (03PS1) 10Marostegui: db2057.yaml: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423879 (https://phabricator.wikimedia.org/T191275) [10:54:11] (03CR) 10Marostegui: [C: 032] db2057.yaml: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/423879 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [10:54:34] 10Operations, 10DBA, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4103958 (10jcrespo) @Cmjohnson and @robh Thanks for all the hard work on eqiad!- once all decommission steps happen (we can and should wait for it to finish, that is mor... [10:57:29] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: thumb.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#4103977 (10ema) [10:58:00] (03PS9) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [10:58:02] (03PS9) 10Rduran: Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [11:01:19] (03PS1) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based mediawiki setups [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [11:01:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423883 (https://phabricator.wikimedia.org/T191275) [11:01:46] (03CR) 10jerkins-bot: [V: 04-1] Disable PrivateTmp via systemd override for stretch-based mediawiki setups [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [11:02:20] !log upgrade apertium on scb1001 [11:02:20] jouncebot: next [11:02:20] In 1 hour(s) and 57 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1300) [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:04] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423883 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [11:03:45] (03PS2) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based mediawiki setups [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [11:04:11] (03CR) 10jerkins-bot: [V: 04-1] Disable PrivateTmp via systemd override for stretch-based mediawiki setups [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [11:04:19] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423883 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [11:05:10] kart_: scb1001 upgraded. let's see if anything breaks [11:06:44] !log Stop MySQL on db2057 for binlog format change, mariadb and kernel upgrade [11:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:49] akosiaris: okay [11:11:20] kart_: I see nothing complaining. I think I should proceed with the rest of the hosts [11:12:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-prep down hosts - fix/remove? - https://phabricator.wikimedia.org/T191293#4104010 (10EddieGP) @MoritzMuehlenhoff: As I saw you commenting about deployment-videoscaler01 on T174477#3737836, do you know (or know who knows) whether... [11:12:34] akosiaris: yes. Tested command line too. [11:12:42] ok, proceeding then [11:13:18] !log upgrade apertium on all scb hosts. Rolling update with in groups of 2 hosts with a 30 seconds delay [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:47] akosiaris: apertium-apy restart is auto or manual in this case? [11:15:09] kart_: it's included in the upgrade [11:15:36] nice! [11:15:36] fyi the command is cumin -b 2 -p1 -s 30 'scb*' 'apt-get install -y apertium apertium-fra-cat apertium-spa-ita apertium-fra apertium-es-it apertium-cat ; service apertium-apy restart' [11:15:52] noted. thanks! [11:16:09] * volans stubs akosiaris for the semicolon [11:16:16] lol [11:16:18] I've not tested cumin flavor yet. [11:16:36] kart_: you can't in production, but I think you can in labs [11:16:38] *stabs ofc [11:17:01] volans: you can't teach an old dog new tricks :P [11:17:11] also, -p1, why? [11:17:44] ah just muscle memory from when I don't care about the output of the command [11:17:52] it's actually wrong in this case [11:17:56] yeaeh [11:18:05] I shouldn't have set it ... [11:18:11] :-P [11:18:18] thankfully no errors [11:18:35] kart_: done. [11:18:56] akosiaris: cool [11:18:59] -m async with two commands could have been more 'cumin-ish' :-P [11:20:15] akosiaris: you may want to go for Puppet patch.. [11:20:30] kart_: yeah just looking into icinga to make sure we got no errors [11:20:36] OK! [11:21:09] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-prep down hosts - fix/remove? - https://phabricator.wikimedia.org/T191293#4104033 (10MoritzMuehlenhoff) Yeah, I think both deployment-tmh01 and deployment-videoscaler01 can be deleted, they are not functional in deployment-prep a... [11:21:09] I see nothing though so I think we are good to go [11:21:11] (03PS1) 10Marostegui: db-codfw.php: db2057 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423887 (https://phabricator.wikimedia.org/T191275) [11:23:55] (03PS1) 10Jon Harald Søby: Add namespace to euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) [11:24:16] (03CR) 10Alexandros Kosiaris: [C: 032] apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [11:24:21] (03PS4) 10Alexandros Kosiaris: apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [11:24:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apertium: Add apertium-separable package [puppet] - 10https://gerrit.wikimedia.org/r/421833 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [11:24:55] kart_: done ^ [11:25:02] akosiaris: \0/ [11:25:06] Thanks a lot! [11:25:48] thanks as well [11:26:19] :) [11:33:37] akosiaris: we've problem it seems. [11:34:46] akosiaris: has apertium-separable installed on all hosts? [11:37:12] fra-cat translation returns: usage: ./lsx-comp [11:38:36] (03PS3) 10Elukey: [WIP]: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [11:41:38] (03PS2) 10Jon Harald Søby: Add namespace to euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) [11:41:55] akosiaris: talking with upstream. Possible bug. [11:44:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2057 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423887 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [11:46:08] (03Merged) 10jenkins-bot: db-codfw.php: db2057 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423887 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [11:47:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: db2057 is now a candidate master for s3 - T191275 (duration: 01m 17s) [11:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:58] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [11:52:37] kart_: yes it's installed everywhere [11:53:11] akosiaris: upstream bug! [11:59:12] (03PS3) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based mediawiki setups [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [12:00:21] !log revert scb hosts to apertium-fra-cat_1.2.0~r78602-1+wmf2 [12:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:28] kart_: ^ done. [12:00:33] let's see if that fixes the issue [12:01:54] akosiaris: checking.. [12:02:37] !log removing /srv/deployment/prometheus from restbase2001/1007 - T181728 [12:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] T181728: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728 [12:02:52] akosiaris: working again. Thanks! [12:02:57] ok [12:03:16] akosiaris: I'll schecule next upgrade once upstream is fixed. Sorry for trouble! [12:04:21] 10Operations, 10Goal, 10Patch-For-Review, 10User-Elukey, 10User-fgiunchedi: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728#4104085 (10elukey) After removing /srv/deployment/prometheus I don't see any trace of the jmx exporter jar containe... [12:04:47] ok, thanks [12:11:18] (03CR) 10Jon Harald Søby: "The deployer should also run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [12:14:13] (03PS4) 10Elukey: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [12:20:43] (03CR) 10Gilles: [C: 031] swift: deprecate webob usage in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) (owner: 10Filippo Giunchedi) [12:24:02] (03PS4) 10Muehlenhoff: Disable PrivateTmp via systemd override for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [12:29:00] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#4104136 (10Gilles) [12:35:04] (03CR) 10Ema: varnish: Remove varnishxcache python daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [12:45:08] (03PS4) 10Vgutierrez: varnish: Remove varnishxcache python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1300). [13:00:05] Jhs: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:44] I can SWAT today [13:00:48] I'm here! [13:01:27] (03PS2) 10Ema: cache_text: use eqiad debug appserver [puppet] - 10https://gerrit.wikimedia.org/r/423866 [13:02:19] (03CR) 10Ema: [C: 032] cache_text: use eqiad debug appserver [puppet] - 10https://gerrit.wikimedia.org/r/423866 (owner: 10Ema) [13:02:20] zeljkof, lemme know if/when you need anything from me :) [13:03:11] Jhs: I'll ping you in a few minutes when the patch is at mwdebug [13:03:28] (Y) [13:05:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4104375 (10Gehel) @Cmjohnson any news on those servers? Do you have an estimate on when you'll have time to work on them? Thanks! [13:05:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [13:07:08] (03CR) 10jenkins-bot: testwikis wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423777 (owner: 1020after4) [13:07:10] (03Merged) 10jenkins-bot: Add namespace to euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [13:08:18] !log upgrade smartmontools to -backports version after https://gerrit.wikimedia.org/r/c/423871/ [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:42] Jhs: the patch is at mwdebug1002, please test and let me know if I can deploy [13:09:42] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#4104380 (10jcrespo) @MoritzMuehlenhoff We should have a full coverage of form on all db and proxy hosts, with the exception of dbproxy1010 and dbproxy1011 that it is managed with th... [13:09:45] Error [13:09:45] Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes. [13:09:45] See the error message at the bottom of this page for more information. [13:10:19] Jhs: ouch [13:11:08] Jhs: where do you see that? [13:11:16] i have no idea how my patch could do that though [13:11:37] zeljkof, on any page with Txikipedia: prefix, e.g. https://eu.wikipedia.org/wiki/Txikipedia:Cyrtodactylus_collegalensis [13:11:45] other pages work, but CSS is messed up [13:12:08] (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423792 (owner: 1020after4) [13:13:17] now also on pages in other namespaces :s [13:13:27] (03CR) 10jenkins-bot: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423047 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [13:13:46] (03CR) 10jenkins-bot: Make a note about the loading order of GlobalPreferences and Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422642 (https://phabricator.wikimedia.org/T190353) (owner: 10Samwilson) [13:13:48] Jhs: I don't see anything strange :| [13:13:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423834 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:14:12] https://eu.wikipedia.org/wiki/Txikipedia:Cyrtodactylus_collegalensis?uselang=en says "There is currently no text in this page. You can search for this page title in other pages, search the related logs, or create this page." [13:14:17] (03CR) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423835 (https://phabricator.wikimedia.org/T188279) (owner: 10Marostegui) [13:14:35] Jhs: should I revert? [13:15:21] zeljkof, could we maybe try on mwdebug1001 instead? [13:16:10] Jhs: I don't think that would change anything [13:16:34] I don't see anything broken, but if you do, I would suggest reverting and debugging later :) [13:16:46] ok :) [13:17:05] the bottom of the pages say: [13:17:06] Request from 77.40.167.80 via cp1053 cp1053, Varnish XID 246450362 [13:17:06] Error: 508, Loop Detected at Wed, 04 Apr 2018 13:16:33 GMT [13:17:28] hm, looking... [13:18:02] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1077 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423863 (owner: 10Marostegui) [13:18:51] (03CR) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423883 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [13:19:23] (03CR) 10jenkins-bot: db-codfw.php: db2057 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423887 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [13:19:44] Jhs: anyway, do we agree to revert? and debug outside of swat window? [13:19:48] yeah, sure [13:19:57] Jhs: ok, reverting [13:20:03] (03CR) 10jenkins-bot: Add namespace to euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423888 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [13:20:10] I'm not sure what is wrong, but obviously something is [13:20:37] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423846 (owner: 10Marostegui) [13:21:06] Jhs: I got the error message now too [13:21:08] strange [13:21:09] (03CR) 10jenkins-bot: db-codfw.php: Depool db2041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423855 (owner: 10Marostegui) [13:21:28] (03CR) 10jenkins-bot: db-codfw.php: db2055 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423843 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [13:21:30] (03PS1) 10BBlack: AU: experiment with splitting WA [dns] - 10https://gerrit.wikimedia.org/r/423909 (https://phabricator.wikimedia.org/T189252) [13:21:40] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423848 (owner: 10Marostegui) [13:22:17] (03CR) 10jenkins-bot: db-codfw.php: db2041 is now a candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423862 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [13:22:19] (03CR) 10BBlack: [C: 032] AU: experiment with splitting WA [dns] - 10https://gerrit.wikimedia.org/r/423909 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [13:23:07] (03CR) 10Filippo Giunchedi: "I'm not a huge fan of introducing other cluster-based labels because I suspect it'll be confusing, though in this case I can't think of a " (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [13:23:09] (03PS1) 10Zfilipin: Revert "Add namespace to euwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423911 [13:23:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423911 (owner: 10Zfilipin) [13:24:34] zeljkof, would that Varnish XID thing be able to tell us something? [13:24:43] Jhs: well, that's all for today, I guess, I'll check the page after the revert, you are free to start debugging :) [13:24:53] (03Merged) 10jenkins-bot: Revert "Add namespace to euwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423911 (owner: 10Zfilipin) [13:24:55] Jhs: probably to somebody, but not me :D [13:26:03] zeljkof, hehe. thanks [13:26:23] i have no idea where to start debugging though, the change is only 3 lines, and i can't see anything wrong :\ [13:26:28] Jhs: sorry, I am not familiar with varnish [13:26:41] sounds like a detergent to me ;) [13:26:46] agreed [13:27:32] (03CR) 10jenkins-bot: Revert "Add namespace to euwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423911 (owner: 10Zfilipin) [13:29:00] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:423911|Revert "Add namespace to euwiki" (T191396)]] (duration: 01m 14s) [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:08] T191396: Add extra namespace in Basque Wikipedia - https://phabricator.wikimedia.org/T191396 [13:29:19] (03PS5) 10Elukey: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [13:32:30] zeljkof, this might be a really stupid question, but is it possible that someone left something at mwdebug1002 so it's not in sync with production? and that could have caused it? [13:32:37] !log EU SWAT finished [13:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:01] * Jhs has no idea how that stuff really works :) [13:33:23] Jhs: possible, but not likely, I have synced mwdebug1002, that should put it in a clean state, but I do not know enough about it, maybe there is a way to mess it up [13:33:28] (03PS6) 10Elukey: prometheus_jmx_exporter_config: fine grained selection of resources [puppet] - 10https://gerrit.wikimedia.org/r/423851 [13:36:36] (03CR) 10Elukey: "Thanks Filippo for the review, I should have fixed all your main concerns." [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [13:38:15] (03PS1) 10Jcrespo: dbproxy: Allow to configure weights per server and increase raise [puppet] - 10https://gerrit.wikimedia.org/r/423914 [13:38:45] (03CR) 10jerkins-bot: [V: 04-1] dbproxy: Allow to configure weights per server and increase raise [puppet] - 10https://gerrit.wikimedia.org/r/423914 (owner: 10Jcrespo) [13:40:35] (03PS2) 10Jcrespo: dbproxy: Allow to configure weights per server and increase raise [puppet] - 10https://gerrit.wikimedia.org/r/423914 [13:42:12] (03PS3) 10Jcrespo: dbproxy: Allow to configure weights per server and increase raise [puppet] - 10https://gerrit.wikimedia.org/r/423914 [13:48:59] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#4104554 (10fgiunchedi) [13:49:57] (03CR) 10Jcrespo: [C: 032] dbproxy: Allow to configure weights per server and increase raise [puppet] - 10https://gerrit.wikimedia.org/r/423914 (owner: 10Jcrespo) [13:50:52] (03CR) 10Filippo Giunchedi: [C: 031] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [13:51:53] (03CR) 10Marostegui: "nice! does this mean that to simply depool a labs host we just set weight to 0 instead of playing around with the IP as we used to do?" [puppet] - 10https://gerrit.wikimedia.org/r/423914 (owner: 10Jcrespo) [13:52:53] (03PS3) 10Filippo Giunchedi: swift: deprecate webob usage in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) [13:52:55] (03CR) 10Jcrespo: "Not 100% sure about this..." [puppet] - 10https://gerrit.wikimedia.org/r/423494 (owner: 10Jcrespo) [13:54:03] (03CR) 10Filippo Giunchedi: [C: 032] swift: deprecate webob usage in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/423852 (https://phabricator.wikimedia.org/T183902) (owner: 10Filippo Giunchedi) [13:56:35] (03CR) 10Jcrespo: [C: 032] "> nice! does this mean that to simply depool a labs host we just set" [puppet] - 10https://gerrit.wikimedia.org/r/423914 (owner: 10Jcrespo) [13:56:36] !log rollout https://gerrit.wikimedia.org/r/c/423852 across ms-fe machines - T183902 [13:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] T183902: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902 [13:59:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) [13:59:33] !log Deploy schema change on dbstore1002:s1 - T187089 T185128 T153182 [13:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:40] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:59:40] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [13:59:40] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [14:00:28] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:02:10] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4104636 (10ema) To get a more up-to-date idea about the percentage of requests we get with AE:br, I've analyzed 30s of GET traffic on cp3033 and was surprised to find zero requests wit... [14:02:54] (03PS2) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) [14:04:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:06:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:08:14] !log uploaded HHVM 3.18.5+wmf6 to component/icu57 for jessie-wikimedia (updated build with the security fix for CVE-2018-6334) [14:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 for alter table (duration: 01m 16s) [14:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423926 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [14:09:54] !log Deploy schema change on db1099:3311 - T187089 T185128 T153182 [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:02] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [14:10:02] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [14:10:02] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [14:11:06] !log purge cron smart-data-dump from lvs100[1-6] [14:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:53] (03CR) 10Mobrovac: [C: 031] profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [14:14:12] (03Abandoned) 10ArielGlenn: download.wikimedia.org moved to misc-web [dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [14:15:31] !log updating deployment-prep to HHVM 3.18.5+wmf6 [14:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:50] (03PS8) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [14:17:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10798/restbase1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [14:17:40] (03Abandoned) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn) [14:20:10] !log apply net.ipv4.tcp_tw_reuse=1 to restbase* via https://gerrit.wikimedia.org/r/#/c/421901 - T190213 [14:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:54] !log Running populateArchiveRevId.php on group0 wikis for T191307 [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:01] T191307: Run maintenance/populateArchiveRevId.php on all wikis - https://phabricator.wikimedia.org/T191307 [14:33:02] (03PS2) 10Ayounsi: Add new SSH key for myself. [puppet] - 10https://gerrit.wikimedia.org/r/423827 [14:40:19] (03CR) 10Ayounsi: [C: 032] Add new SSH key for myself. [puppet] - 10https://gerrit.wikimedia.org/r/423827 (owner: 10Ayounsi) [14:41:58] (03PS1) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [14:42:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [14:43:10] (03PS2) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [14:43:15] (03PS1) 10Madhuvishy: dumps: Move dumps.wikimedia real cert from dataset1001 to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/423933 (https://phabricator.wikimedia.org/T188646) [14:43:36] (03PS2) 10Madhuvishy: dumps: Move dumps.wikimedia real cert from dataset1001 to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/423933 (https://phabricator.wikimedia.org/T188646) [14:43:54] (03PS3) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [14:49:37] (03PS3) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [14:56:24] (03PS4) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [14:57:16] (03PS1) 10Jcrespo: dbstore1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423936 (https://phabricator.wikimedia.org/T186596) [14:57:38] (03CR) 10Marostegui: [C: 031] dbstore1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423936 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [14:57:42] (03PS2) 10Jcrespo: dbstore1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423936 (https://phabricator.wikimedia.org/T186596) [14:58:16] (03CR) 10Jcrespo: [C: 032] dbstore1001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/423936 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [14:59:03] (03PS5) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [14:59:45] (03PS1) 10Filippo Giunchedi: swift: don't error on unknown HTTP status reason [puppet] - 10https://gerrit.wikimedia.org/r/423937 (https://phabricator.wikimedia.org/T183902) [15:00:55] (03PS1) 10Madhuvishy: Update dumps TTL to 1M in prep for CNAME switch to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423938 (https://phabricator.wikimedia.org/T188646) [15:03:09] (03PS2) 10Filippo Giunchedi: swift: don't error on unknown HTTP status reason [puppet] - 10https://gerrit.wikimedia.org/r/423937 (https://phabricator.wikimedia.org/T183902) [15:04:26] (03PS1) 10Jcrespo: mariadb: Prepare dbstore1001 for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/423942 (https://phabricator.wikimedia.org/T186596) [15:05:08] (03CR) 10Marostegui: [C: 031] "+1 but I guess we should do all the HW tests before the reimage to avoid doing the reimage twice, no?" [puppet] - 10https://gerrit.wikimedia.org/r/423942 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [15:05:22] (03PS4) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [15:06:21] !log delete /srv/deployment/prometheus from restbase* as clean up step for T181728 [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] T181728: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728 [15:07:03] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare dbstore1001 for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/423942 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [15:07:18] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4104947 (10Ottomata) a:03Ottomata [15:07:37] !log mobrovac@tin Started restart [restbase/deploy@f3a53b6]: Pick up the net.ipv4.tcp_tw_reuse flag change - T190213 [15:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:55] (03PS2) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [15:08:08] (03PS1) 10Madhuvishy: dumps: Remove rsync mirror access for fi.muni.cz [puppet] - 10https://gerrit.wikimedia.org/r/423944 [15:08:16] (03PS3) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [15:08:28] (03CR) 10Marostegui: "I have done a few tests on my lab and it is working finely." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [15:09:08] (03PS3) 10Filippo Giunchedi: swift: don't error on unknown HTTP status reason [puppet] - 10https://gerrit.wikimedia.org/r/423937 (https://phabricator.wikimedia.org/T183902) [15:09:14] (03CR) 10Filippo Giunchedi: [C: 032] swift: don't error on unknown HTTP status reason [puppet] - 10https://gerrit.wikimedia.org/r/423937 (https://phabricator.wikimedia.org/T183902) (owner: 10Filippo Giunchedi) [15:09:45] 10Operations, 10wikidiff2, 10Patch-For-Review, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4104961 (10MoritzMuehlenhoff) I've built a 1.6.0 package against the HHVM version currently running on beta (which is linked against th... [15:10:52] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4104965 (10jcrespo) This is blocked on @Cmjohnson to have a gap for firmware and BIOS upgrade + RAID rebuild as asked here T18659... [15:10:54] (03PS1) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) [15:11:01] (03CR) 10Daimona Eaytoy: [C: 04-1] "Per reason explained in commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [15:11:20] (03CR) 10Madhuvishy: [C: 032] Update dumps TTL to 1M in prep for CNAME switch to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423938 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [15:11:26] (03PS5) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [15:12:54] taking over tin for 2 mins (merging/syncing a no-op clean-up patch) [15:13:01] (03PS2) 10Mobrovac: Clean up the config for high traffic jobs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423710 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:14:05] !log Update ttl for dumps.wikimedia.org CNAME to 1M in prep for switchover to labstore1007 T188646 [15:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:11] T188646: Point dumps.wikimedia.org to labstore1006/7 - https://phabricator.wikimedia.org/T188646 [15:15:03] 10Operations, 10Commons, 10MediaWiki-Database, 10Multimedia, and 4 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#4104992 (10jcrespo) The latest errors are : "Deadlock found when trying to get lock; try restarting transaction". Maybe ther... [15:15:31] (03CR) 10Mobrovac: [C: 032] Clean up the config for high traffic jobs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423710 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:16:34] (03PS2) 10Madhuvishy: dumps: Remove rsync mirror access for fi.muni.cz [puppet] - 10https://gerrit.wikimedia.org/r/423944 [15:16:48] (03Merged) 10jenkins-bot: Clean up the config for high traffic jobs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423710 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:17:47] (03PS3) 10Madhuvishy: dumps: Remove rsync mirror access for fi.muni.cz [puppet] - 10https://gerrit.wikimedia.org/r/423944 [15:18:23] (03CR) 10jenkins-bot: Clean up the config for high traffic jobs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423710 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:18:26] (03CR) 10Madhuvishy: [C: 032] dumps: Remove rsync mirror access for fi.muni.cz [puppet] - 10https://gerrit.wikimedia.org/r/423944 (owner: 10Madhuvishy) [15:19:59] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Clean up config for the rest of high-traffic jobs after the switch - T190327 (duration: 01m 16s) [15:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:05] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [15:21:43] 10Operations, 10DBA: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#4105044 (10jcrespo) a:03jcrespo I am going to start doing some tests onto es2002. [15:23:10] done [15:23:13] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#4105065 (10elukey) [15:23:15] 10Operations, 10Goal, 10Patch-For-Review, 10User-Elukey, 10User-fgiunchedi: Stop using jmx_exporter deployed via scap in favour of Debian package - https://phabricator.wikimedia.org/T181728#4105064 (10elukey) 05Open>03Resolved [15:23:43] elukey: \o/ [15:25:35] \o/ [15:25:44] (03PS2) 10Bstorm: wiki replicas: correct small but critical bugs in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/423833 (https://phabricator.wikimedia.org/T181650) [15:28:25] (03CR) 10Bstorm: [C: 032] wiki replicas: correct small but critical bugs in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/423833 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:32:08] (03PS1) 10Jcrespo: mariadb: Depool es2015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423949 (https://phabricator.wikimedia.org/T153440) [15:33:38] (03CR) 10Jcrespo: "FYI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423949 (https://phabricator.wikimedia.org/T153440) (owner: 10Jcrespo) [15:33:52] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423949 (https://phabricator.wikimedia.org/T153440) (owner: 10Jcrespo) [15:35:05] (03Merged) 10jenkins-bot: mariadb: Depool es2015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423949 (https://phabricator.wikimedia.org/T153440) (owner: 10Jcrespo) [15:37:32] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2015 (duration: 01m 17s) [15:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:28] (03CR) 10jenkins-bot: mariadb: Depool es2015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423949 (https://phabricator.wikimedia.org/T153440) (owner: 10Jcrespo) [15:44:06] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#4105174 (10fgiunchedi) 05Open>03Resolved This is now deployed, 501s did indeed disappear now, `rewrite.py` is using `swob` from swift... [15:44:09] !log starting backup from es2015 (will create lag) [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:44] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [15:50:44] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.377 second response time [15:53:44] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [15:54:44] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.013 second response time [15:55:04] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [16:00:05] stephanebisson, gehel, and RoanKattouw: #bothumor My software never has bugs. It just develops random features. Rise for Maps services. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:27] 10Operations, 10DBA, 10Patch-For-Review: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#4105257 (10jcrespo) [16:02:00] so I guess that the misc 5xx are due to graphite-labs right ? [16:02:37] yeah [16:03:04] yep [16:05:04] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [16:06:07] (03PS1) 10Imarlier: NavigationTiming: Move logic to the server side [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) [16:11:08] (03PS1) 10BryanDavis: admin: Allow wmcs-roots access to role::labs::monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/423960 (https://phabricator.wikimedia.org/T162404) [16:15:49] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4105312 (10EddieGP) @Krenair created that instance according to openstack browser. Can you tell whether this instance is still neede... [16:16:01] 10Operations, 10Puppet, 10puppet-compiler: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438#4105315 (10herron) p:05Triage>03Normal [16:17:07] (03PS6) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [16:17:40] (03PS3) 10Madhuvishy: dumps: Move dumps.wikimedia real cert from dataset1001 to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/423933 (https://phabricator.wikimedia.org/T188646) [16:17:50] (03PS2) 10Imarlier: NavigationTiming: Move logic to the server side [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) [16:17:55] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4105329 (10ema) >>! In T137979#4104636, @ema wrote: > during that timeframe we only received AE:br requests for methods other than GET (OPTIONS, POST). That was due to the fact that va... [16:19:39] (03PS3) 10Imarlier: NavigationTiming: Move logic to the server side [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) [16:20:01] (03CR) 10Ottomata: [C: 031] Add the -skipTrash option to hdfs -rm [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423845 (https://phabricator.wikimedia.org/T189051) (owner: 10Elukey) [16:21:28] (03CR) 10Madhuvishy: [C: 032] dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [16:21:46] (03PS6) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [16:22:47] !log Change CNAME for dumps.wikimedia.org to labstore1007 T188646 [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:53] T188646: Point dumps.wikimedia.org to labstore1006/7 - https://phabricator.wikimedia.org/T188646 [16:24:01] (03CR) 10Madhuvishy: [C: 032] dumps: Move dumps.wikimedia real cert from dataset1001 to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/423933 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [16:25:50] 10Operations, 10DBA, 10Patch-For-Review: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#4105354 (10jcrespo) @ayounsi We are performing this 5.4TB backup by saturating the es2015 and es2002 host lin... [16:26:02] !log Move cert for dumps.wikimedia.org to labstore1007 (do_acme: true) T188646 [16:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:22] 10Operations, 10Analytics, 10New-Readers, 10Traffic, and 2 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4105369 (10atgo) Partnerships has been looking for a contact at Opera. We reached out to someone yesterday who is OOO until next week. Will keep you updated. [16:28:11] (03PS7) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [16:33:05] (03PS1) 10Bstorm: wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) [16:33:39] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:40:06] (03PS2) 10Bstorm: wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) [16:40:47] 10Operations, 10Puppet, 10puppet-compiler: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438#4105446 (10herron) Took a stab at building a stretch compiler and ran into this when attempting to start etcd... ``` Apr 04 15:32:51 compiler-keith-stretch1 systemd[1]: Starting etcd...... [16:40:51] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:42:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4105469 (10RobH) I started the idrac firmware update, it is taking 5+ minutes to update. when done, it should show version 2.52 f... [16:43:15] (03PS3) 10Bstorm: wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) [16:47:01] !log ppchelko@tin Started deploy [cpjobqueue/deploy@d4a84ae]: Support multi-topic rules T191238 [16:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:08] T191238: Add support for catch-all rule in ChangeProp - https://phabricator.wikimedia.org/T191238 [16:47:25] (03PS8) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [16:47:43] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@d4a84ae]: Support multi-topic rules T191238 (duration: 00m 42s) [16:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:08] !log dbstore1001 rebooting for bios firmware update [16:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:55] (03PS1) 10Andrew Bogott: designate: add mitaka designate.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/423977 [16:57:31] (03PS2) 10Andrew Bogott: designate: add mitaka designate.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/423977 [16:58:13] !log ppchelko@tin Started deploy [cpjobqueue/deploy@60a2292]: Revert: Support multi-topic rules [16:58:14] (03CR) 10Andrew Bogott: [C: 032] designate: add mitaka designate.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/423977 (owner: 10Andrew Bogott) [16:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:34] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@60a2292]: Revert: Support multi-topic rules (duration: 00m 21s) [16:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:45] (03PS1) 10Andrew Bogott: designate: add mitaka version of api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/423980 [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1700). [17:00:04] subbu and tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:14] o/ [17:00:16] (03CR) 10Andrew Bogott: [C: 032] designate: add mitaka version of api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/423980 (owner: 10Andrew Bogott) [17:01:30] o/ [17:01:55] my patch is not really testable before it's live [17:02:38] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4105567 (10mobrovac) [17:03:15] I can SWAT [17:03:33] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10mobrovac) [17:04:32] (03PS3) 10Thcipriani: Enable RemexHtml on all wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 (https://phabricator.wikimedia.org/T188881) (owner: 10Subramanya Sastry) [17:04:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 (https://phabricator.wikimedia.org/T188881) (owner: 10Subramanya Sastry) [17:04:55] hrm zuul is looking a little...full [17:05:44] yeah [17:06:28] (03Merged) 10jenkins-bot: Enable RemexHtml on all wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 (https://phabricator.wikimedia.org/T188881) (owner: 10Subramanya Sastry) [17:06:51] queue priority ftw [17:07:46] subbu: I've pulled RemexHtml on all wikimedia wikis to mwdebug1002, check please [17:07:51] k [17:08:15] (03CR) 10jenkins-bot: Enable RemexHtml on all wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 (https://phabricator.wikimedia.org/T188881) (owner: 10Subramanya Sastry) [17:08:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4105607 (10EddieGP) [17:08:49] thcipriani, lgtm .. good to go if there are no errors in logs. [17:09:11] subbu: don't see anything in logs, going live [17:10:32] (03PS4) 10Bstorm: wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) [17:11:44] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:423491|Enable RemexHtml on all wikimedia wikis]] T188881 (duration: 01m 18s) [17:11:49] ^ subbu live now [17:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:52] T188881: Enable RemexHTML on all wikimedia chapters/user groups wikis - https://phabricator.wikimedia.org/T188881 [17:12:20] thanks. [17:12:28] (03PS3) 10Thcipriani: Enable RemexHtml on all wikiquotes except frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) (owner: 10Subramanya Sastry) [17:12:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) (owner: 10Subramanya Sastry) [17:13:50] (03CR) 1020after4: [C: 031] ""scap should control which is the baseline date against which to verify the error rate, not logstash_checker, that could be used by other " [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [17:14:11] (03Merged) 10jenkins-bot: Enable RemexHtml on all wikiquotes except frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) (owner: 10Subramanya Sastry) [17:15:04] subbu: ^ is live on mwdebug1002, check please [17:15:10] k [17:15:22] (03PS9) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [17:16:15] (03Abandoned) 10Thcipriani: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [17:16:16] thcipriani, lgtm. [17:16:35] subbu: okie doke, going live [17:17:03] 650 done, 250 more to go. the end is in sight. [17:18:05] :) [17:18:25] (03PS1) 10Andrew Bogott: designate: add a bunch of mitaka-version files [puppet] - 10https://gerrit.wikimedia.org/r/423985 [17:18:34] (03CR) 10jenkins-bot: Enable RemexHtml on all wikiquotes except frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) (owner: 10Subramanya Sastry) [17:18:58] (03CR) 10Andrew Bogott: [C: 032] designate: add a bunch of mitaka-version files [puppet] - 10https://gerrit.wikimedia.org/r/423985 (owner: 10Andrew Bogott) [17:19:06] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:423492|Enable RemexHtml on all wikiquotes except frwikiquote]] T190726 (duration: 01m 17s) [17:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:12] T190726: Enable RemexHTML on all wikiquotes (except frwikiquote) - https://phabricator.wikimedia.org/T190726 [17:19:31] ^ subbu all live, congrats :) [17:19:43] \o/ ty. [17:20:03] (03PS2) 10Thcipriani: Enable TemplateStyles on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910) (owner: 10Gergő Tisza) [17:20:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910) (owner: 10Gergő Tisza) [17:21:09] tgr: no way to test ^ on mwdebug? [17:21:59] (03Merged) 10jenkins-bot: Enable TemplateStyles on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910) (owner: 10Gergő Tisza) [17:22:01] (03CR) 10Bstorm: [C: 032] wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [17:22:10] (03PS5) 10Bstorm: wiki replicas: fix complex where with regex substitution [puppet] - 10https://gerrit.wikimedia.org/r/423965 (https://phabricator.wikimedia.org/T181650) [17:22:49] thcipriani: can't, need to write data to test [17:22:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4105730 (10RobH) I neglected to note the old bios and drac versions, but they are now latest versions each and done. h710 mini co... [17:23:22] tgr: ok, I'll push it live then [17:23:33] (03CR) 10jenkins-bot: Enable TemplateStyles on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910) (owner: 10Gergő Tisza) [17:23:40] (03PS10) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [17:24:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4105743 (10RobH) @jcrespo: Did you want to handle the raid rebuild? I'm not exactly sure what you want? (Just to wipe it all out... [17:26:07] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:422414]] Enable TemplateStyles on dewiki T190910 (duration: 01m 17s) [17:26:11] ^ tgr live now [17:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:15] T190910: Create and deploy configuration change to enable TemplateStyles on German Wikipedia on 2018-04-04 - https://phabricator.wikimedia.org/T190910 [17:28:56] 10Operations, 10Graphite, 10Services (watching): Cassandra Graphite metrics space usage audit and cleanup - https://phabricator.wikimedia.org/T191315#4105780 (10mobrovac) > It seems to me that with the new restbase storage and the move to Prometheus we don't need to refresh Cassandra-dedicated Graphite hardw... [17:32:05] (03PS11) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [17:37:40] (03PS1) 10Mark Bergsma: Don't use deprecated TestCase methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/423994 [17:37:42] (03PS1) 10Mark Bergsma: Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 [17:38:58] (03PS1) 10Mark Bergsma: Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997 [17:39:00] (03PS1) 10Mark Bergsma: Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 [17:39:02] (03PS1) 10Mark Bergsma: Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 [17:39:04] (03PS1) 10Mark Bergsma: Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000 [17:39:34] (03PS1) 10Mark Bergsma: Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001 [17:39:36] (03PS1) 10Mark Bergsma: Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002 [17:39:38] (03PS1) 10Mark Bergsma: Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003 [17:39:40] (03PS1) 10Mark Bergsma: Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004 [17:39:42] (03PS1) 10Mark Bergsma: Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005 [17:39:44] (03PS1) 10Mark Bergsma: Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006 [17:39:46] (03PS1) 10Mark Bergsma: Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007 [17:39:48] (03PS1) 10Mark Bergsma: Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008 [17:39:50] (03PS1) 10Mark Bergsma: Cleanup module for consistency [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 [17:39:52] (03PS1) 10Mark Bergsma: Fix test case ESTABLISHED event 27 hold time nonzero [debs/pybal] - 10https://gerrit.wikimedia.org/r/424010 [17:39:54] (03PS1) 10Mark Bergsma: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 [17:40:06] (03PS12) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [17:40:29] (03PS1) 10Bstorm: wiki replicas: Fix selection regex on where clause [puppet] - 10https://gerrit.wikimedia.org/r/424012 [17:42:26] (03CR) 10Bstorm: [C: 032] wiki replicas: Fix selection regex on where clause [puppet] - 10https://gerrit.wikimedia.org/r/424012 (owner: 10Bstorm) [17:48:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4105855 (10Cmjohnson) @ayounsi I added wdqs1007 to ge-1/0/2 [17:49:56] thcipriani: doesn't seem to work [17:50:11] tgr: oh? should I rollback? [17:50:16] doesn't break anything either, so probably best to leave it on for debugging [17:50:45] the extension is listed in Special:Version but seems to have no actual effect [17:50:52] so it's not a deployment problem [17:51:16] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4007597 (10Cmjohnson) [17:51:24] well, that's not true, parts of it work as expected [17:51:41] is there a maintenance script for changing content model? [17:52:02] staff group should really have that right [17:52:07] (03PS1) 10Madhuvishy: Reset ttl for dumps.wikimedia.org to 1H post switchover [dns] - 10https://gerrit.wikimedia.org/r/424014 (https://phabricator.wikimedia.org/T188646) [17:52:08] I'm not aware of one [17:53:49] (03CR) 10Madhuvishy: [C: 032] Reset ttl for dumps.wikimedia.org to 1H post switchover [dns] - 10https://gerrit.wikimedia.org/r/424014 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [17:54:42] !log Reset ttl for dumps.wikimedia.org CNAME to 1H post switchover to labstore1007 T188646 [17:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:48] T188646: Point dumps.wikimedia.org to labstore1006/7 - https://phabricator.wikimedia.org/T188646 [17:55:08] Hi jynus - forgot to ping [17:55:17] (03PS1) 10RobH: updating labvirt1022 mac address [puppet] - 10https://gerrit.wikimedia.org/r/424016 (https://phabricator.wikimedia.org/T183937) [17:55:32] jynus: We've started sqooping the tables that failed on the 2nd [17:56:31] (03CR) 10RobH: [C: 032] updating labvirt1022 mac address [puppet] - 10https://gerrit.wikimedia.org/r/424016 (https://phabricator.wikimedia.org/T183937) (owner: 10RobH) [17:57:21] (03PS13) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [17:57:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [17:59:12] (03PS14) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [18:00:04] mutante: #bothumor I � Unicode. All rise for Deployment server swap deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:19] (03PS15) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [18:01:57] !log ppchelko@tin Started deploy [cpjobqueue/deploy@0185e74]: Fix the metric names and support multi-topic rules [18:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:10] (03PS1) 10Cmjohnson: dhcpd entries for wdqs1006-8 [puppet] - 10https://gerrit.wikimedia.org/r/424019 (https://phabricator.wikimedia.org/T188432) [18:02:32] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@0185e74]: Fix the metric names and support multi-topic rules (duration: 00m 35s) [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:03] (03CR) 10Cmjohnson: [C: 032] dhcpd entries for wdqs1006-8 [puppet] - 10https://gerrit.wikimedia.org/r/424019 (https://phabricator.wikimedia.org/T188432) (owner: 10Cmjohnson) [18:04:17] (03PS16) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [18:04:27] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4105906 (10Cmjohnson) a:05Cmjohnson>03RobH @robh assigning to you...everything but netboot.cfg is finished. [18:07:35] (03PS17) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [18:08:37] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Request for one additional RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105919 (10dduvall) [18:10:20] (03PS1) 10Bstorm: wiki replicas: regex-add database name in joins for WHERE statements [puppet] - 10https://gerrit.wikimedia.org/r/424023 (https://phabricator.wikimedia.org/T181650) [18:12:58] 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4105938 (10madhuvishy) 05Open>03Resolved [18:17:19] !log ppchelko@tin Started deploy [cpjobqueue/deploy@0125bc4]: Fixed the new metrics names. Again [18:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:55] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@0125bc4]: Fixed the new metrics names. Again (duration: 00m 37s) [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:23] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3867914 (10RobH) Ok, these are ready for service implementation. Handing off to @chasemp. labvirt1021 has puppet signed but wont run (kernel version issue for some puppet packages for use in clo... [18:20:22] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4105962 (10Cmjohnson) Quick update...this is turning into a giant time suck....HP does not know what the problem is, now they're asking for pictures of the cabling... [18:20:32] gehel: Thanks for fixing Cassandra on deployment-maps01! It keeps breaking like that and I don't understand why (see phab comment) [18:20:58] (03CR) 10Bstorm: [C: 032] wiki replicas: regex-add database name in joins for WHERE statements [puppet] - 10https://gerrit.wikimedia.org/r/424023 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:26:03] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Request for one additional RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105973 (10RobH) [18:26:43] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Request for one additional RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105919 (10RobH) [18:27:08] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105919 (10RobH) [18:30:44] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105919 (10demon) I authorize this as @greg's delegate while he's on vacation. [18:31:00] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4105983 (10RobH) I've updated the task description a bit, and included the checklist we require. I'll also note... [18:31:14] im not quite sure what group is needed on that ^? [18:31:19] he is already in contint-admins [18:31:24] which as all kinds of rights [18:32:53] nm, found it [18:33:04] task updated, needs meeting review. [18:33:26] Yeah, it's contint-roots, right? [18:33:27] Or something? [18:33:38] yeah [18:33:43] with only hashar in it, heh [18:33:58] yep, that's the one. [18:34:10] thcipriani: cool, seems legit to me but not my call [18:34:12] yep yep [18:34:18] i'll support when listed in monday's meeting =] [18:34:31] robh: cool, thanks :) [18:39:00] (03PS1) 10Andrew Bogott: designate: s/domain/zone/ throughout [puppet] - 10https://gerrit.wikimedia.org/r/424030 (https://phabricator.wikimedia.org/T187954) [18:41:38] (03CR) 10Andrew Bogott: [C: 032] designate: s/domain/zone/ throughout [puppet] - 10https://gerrit.wikimedia.org/r/424030 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [18:45:18] (03PS1) 10Andrew Bogott: designate: add a pool config for labtest [puppet] - 10https://gerrit.wikimedia.org/r/424031 (https://phabricator.wikimedia.org/T187954) [18:50:46] 10Operations, 10ops-codfw, 10Traffic: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4106021 (10Papaul) Your Service Request SR#: 963059814 Contact Us | Support Library | Download Center | SupportAssist | Community Forums Dear Papaul Tshibamba, Current Status: This e-mail serves as... [18:51:10] (03CR) 10Andrew Bogott: [C: 032] designate: add a pool config for labtest [puppet] - 10https://gerrit.wikimedia.org/r/424031 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [18:55:50] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4106039 (10Papaul) Your Service Request SR#: 963061588 Contact Us | Support Library | Download Center | SupportAssist | Community Forums Dear PAPAUL TSHIBAMBA, Current Status: This e-mail serves as... [18:56:08] 10Operations, 10ops-codfw, 10Traffic: cp2017 memory replacement - https://phabricator.wikimedia.org/T191227#4106049 (10Papaul) Your Service Request SR#: 963052179 Contact Us | Support Library | Download Center | SupportAssist | Community Forums Dear Papaul Tshibamba, Current Status: This e-mail serves as... [18:56:10] (03CR) 10jerkins-bot: [V: 04-1] Deploy TemplateStyles to frwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) (owner: 10Zoranzoki21) [18:57:15] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4106059 (10Papaul) Your Service Request SR#: 963052308 Contact Us | Support Library | Download Center | SupportAssist | Community Forums Dear Papaul Tshibamba, Current Status: This e-mail serves as... [18:57:58] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4106061 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [18:58:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3857747 (10Cmjohnson) 05Resolved>03Open [18:59:05] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#4106078 (10MusikAnimal) [18:59:09] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#4106082 (10MusikAnimal) [19:00:04] twentyafterfour: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:45] 10Operations, 10MediaWiki-extensions-PageAssessments: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#4106091 (10MusikAnimal) [19:01:06] 10Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#4106097 (10MusikAnimal) [19:02:58] 10Operations: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4106130 (10Cmjohnson) [19:03:34] !log upgraded blubbler 0.2.0-1 -> 0.3.0-1 on contint1001 and contint2001 [19:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:02] (03PS4) 10Zoranzoki21: Deploy TemplateStyles to frwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) [19:04:09] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#4106155 (10herron) >>! In T184564#4103864, @faidon wrote: > In terms of code, what would the changes required be? I don't have a list of code changes offhand. Going through the process of compiling/diffing all hosts aga... [19:04:10] (03CR) 10jerkins-bot: [V: 04-1] Deploy TemplateStyles to frwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) (owner: 10Zoranzoki21) [19:05:05] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4106160 (10Cmjohnson) [19:05:16] 10Operations, 10DBA, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4106165 (10Cmjohnson) [19:05:19] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4106164 (10Cmjohnson) [19:05:22] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4052908 (10Cmjohnson) 05Open>03Resolved [19:06:54] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#3676074 (10Cmjohnson) @robh any status update on this? [19:09:24] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424032 [19:09:26] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424032 (owner: 1020after4) [19:10:53] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424032 (owner: 1020after4) [19:11:09] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.28 refs T183967 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424032 (owner: 1020after4) [19:13:11] (03PS1) 10Cmjohnson: Removing dns fro decom indium [dns] - 10https://gerrit.wikimedia.org/r/424033 (https://phabricator.wikimedia.org/T165345) [19:18:05] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.28 refs T183967 [19:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:23] T183967: 1.31.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T183967 [19:19:22] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.28 refs T183967 (duration: 01m 16s) [19:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:53] 10Operations, 10ops-eqiad, 10Patch-For-Review: decommission indium - https://phabricator.wikimedia.org/T165345#4106267 (10Cmjohnson) [19:21:07] 10Operations, 10ops-eqiad, 10Patch-For-Review: decommission indium - https://phabricator.wikimedia.org/T165345#3263693 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [19:21:44] (03Abandoned) 10Mobrovac: pdfrender: Tell SystemD to log directly into a file [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) (owner: 10Mobrovac) [19:26:11] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4106350 (10mobrovac) [19:26:16] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4106348 (10mobrovac) [19:26:57] 10Operations, 10Epic, 10Goal, 10Services (done), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#4106354 (10mobrovac) [19:27:01] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10mobrovac) [19:28:07] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#4106358 (10mobrovac) [19:28:12] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#4106360 (10mobrovac) [19:29:52] (03PS2) 10Cmjohnson: Removing dns fro decom indium [dns] - 10https://gerrit.wikimedia.org/r/424033 (https://phabricator.wikimedia.org/T165345) [19:31:35] jouncebot: now [19:31:36] For the next 1 hour(s) and 28 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T1900) [19:32:32] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team: Request for "administrator" rights on beta cluster - https://phabricator.wikimedia.org/T191356#4106384 (10awight) Thank you, I was able to confirm suppression works in my edge case. [19:34:07] (03CR) 10Cmjohnson: [C: 032] Removing dns fro decom indium [dns] - 10https://gerrit.wikimedia.org/r/424033 (https://phabricator.wikimedia.org/T165345) (owner: 10Cmjohnson) [19:35:56] jouncebot: next [19:35:56] In 0 hour(s) and 24 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T2000) [19:38:09] 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4106393 (10herron) Hi @Jonas, could you please elaborate on why this access is required, and who is vouching for it? CC: @nuria and Analytics for review. Thanks in... [19:38:39] cscott: are you going to use your deployment window for parsoid? [19:47:53] subbu: do you know about deployments ^ ? is https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday,_Apr._2,_2018_around_1:15_pm_PT:_d887aff_to_be_deployed still open from Monday? [19:48:20] mutante, yes .. we haven't deployed that yet. [19:50:27] (03PS1) 10Madhuvishy: Add ipv6 addresses for labstore1006 and 7 [dns] - 10https://gerrit.wikimedia.org/r/424044 (https://phabricator.wikimedia.org/T188646) [19:51:04] mutante, so, yes .. arlolra is going to use the slot for deployment. [19:51:10] why do you ask? [19:51:11] subbu: are you planning to deploy that in the upcoming window [19:51:22] i want to know if you are using the window today [19:51:27] yes. [19:51:29] we are. [19:51:29] not because of a specific patch [19:51:38] ok, thanks! [19:51:43] that's already all i needed [19:52:00] that means i will not touch the deployment server before you did that [19:52:49] (03PS18) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [19:53:00] (03CR) 10Madhuvishy: [C: 032] Add ipv6 addresses for labstore1006 and 7 [dns] - 10https://gerrit.wikimedia.org/r/424044 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [19:57:09] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#4106432 (10RobH) Do these need to take priority over other decoms in the backlog? [19:57:47] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#4106434 (10RobH) [19:58:02] (03PS19) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [19:59:25] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4106442 (10dcausse) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:02:07] akosiaris: Ready for a re-review, https://gerrit.wikimedia.org/r/#/c/423752/ [20:03:46] oops, that’s #relengy [20:05:22] subbu: ^ it thinks there are no patches in window [20:06:20] mutante, ya, it says that always for some reason. [20:06:25] !log arlolra@tin Started deploy [parsoid/deploy@a8e759f]: Updating Parsoid to d887aff [20:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:39] subbu: alright, because it's a separate calendar i guess [20:08:24] (03PS1) 10RobH: updating for wdqs100[678] [puppet] - 10https://gerrit.wikimedia.org/r/424049 (https://phabricator.wikimedia.org/T188432) [20:09:24] (03PS20) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [20:09:28] (03CR) 10RobH: [C: 032] updating for wdqs100[678] [puppet] - 10https://gerrit.wikimedia.org/r/424049 (https://phabricator.wikimedia.org/T188432) (owner: 10RobH) [20:15:37] !log mholloway-shell@tin Started deploy [mobileapps/deploy@940bd48]: Update mobileapps to 58a0a88 [20:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:56] (03PS21) 10Ottomata: [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [20:16:38] !log mholloway-shell@tin Started deploy [mobileapps/deploy@0460519]: Update mobileapps to 2d5ab5b [20:16:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [20:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:29] mdholloway: did you cancel the other deploy? [20:17:31] 10Operations, 10ops-eqiad: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4106523 (10RobH) p:05Triage>03Normal [20:17:37] bearND: yes [20:17:42] good [20:18:22] !log arlolra@tin Finished deploy [parsoid/deploy@a8e759f]: Updating Parsoid to d887aff (duration: 11m 58s) [20:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:36] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@0460519]: Update mobileapps to 2d5ab5b (duration: 05m 58s) [20:22:37] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.28/skins/MonoBook: sync https://gerrit.wikimedia.org/r/#/c/424041/ (duration: 01m 16s) [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:00] (03PS1) 10Madhuvishy: Add PTR records for labstore1006|7 [dns] - 10https://gerrit.wikimedia.org/r/424063 (https://phabricator.wikimedia.org/T188646) [20:24:47] (03CR) 10Madhuvishy: [C: 032] Add PTR records for labstore1006|7 [dns] - 10https://gerrit.wikimedia.org/r/424063 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [20:27:01] !log Updated Parsoid to d887aff (T177102, T189474) [20:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:09] T177102: Please consider inserting a blank line before and after lists, because it's pretty - https://phabricator.wikimedia.org/T177102 [20:27:09] T189474: Image captions should ignore paragraph tags - https://phabricator.wikimedia.org/T189474 [20:27:49] 10Operations, 10Puppet: Retire nitrogen and nihal ganeti VMs - https://phabricator.wikimedia.org/T191467#4106564 (10herron) p:05Triage>03Normal [20:28:50] (03PS1) 10Herron: remove nitrogen and nihal from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/424064 (https://phabricator.wikimedia.org/T191467) [20:29:09] (03PS1) 10Jon Harald Søby: Add namespace for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424065 (https://phabricator.wikimedia.org/T191396) [20:31:44] (03PS1) 10Herron: remove nitrogen and nihal from forward/reverse dns [dns] - 10https://gerrit.wikimedia.org/r/424067 (https://phabricator.wikimedia.org/T191467) [20:38:16] 10Operations: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4106633 (10RobH) a:05RobH>03Gehel [20:38:29] 10Operations: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4007597 (10RobH) Assigned to @gehel as these are now ready for service implementation. [20:41:35] 10Operations, 10Puppet, 10Patch-For-Review: Retire nitrogen and nihal ganeti VMs - https://phabricator.wikimedia.org/T191467#4106641 (10herron) Hosts have been placed into 2 days downtime and shut down. Will proceed with deletion tomorrow [20:45:09] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4106658 (10awight) I discovered in beta cluster testing that we'll have to... [20:48:28] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4106708 (10awight) [21:08:51] (03PS1) 10Ayounsi: Prometheus: aggregates netstat_Icmp_In and InEcho by cluster [puppet] - 10https://gerrit.wikimedia.org/r/424139 [21:15:19] (03CR) 10Krinkle: [C: 04-1] NavigationTiming: Move logic to the server side (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) (owner: 10Imarlier) [21:26:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4106852 (10ayounsi) [21:27:56] (03CR) 10Dereckson: [C: 031] Add namespace for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424065 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [21:42:24] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#4106880 (10RobH) [21:43:15] (03PS1) 10Jforrester: For wikis with consolidated feedback, send 2017WTE notes to a better page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424147 (https://phabricator.wikimedia.org/T157953) [21:48:58] (03PS4) 10Paladox: Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) [21:49:25] (03PS4) 10Imarlier: NavigationTiming: Move logic to the server side [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) [21:49:49] (03CR) 10Imarlier: NavigationTiming: Move logic to the server side (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423959 (https://phabricator.wikimedia.org/T181425) (owner: 10Imarlier) [21:51:39] (03PS1) 10RobH: decom ocg100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/424148 (https://phabricator.wikimedia.org/T177958) [21:52:02] 10Operations, 10Patch-For-Review, 10Scoring-platform-team (Current): Remove deprecated hosts from ORES scap config - https://phabricator.wikimedia.org/T191321#4101428 (10awight) a:03awight [21:52:06] (03CR) 10RobH: [C: 032] decom ocg100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/424148 (https://phabricator.wikimedia.org/T177958) (owner: 10RobH) [21:53:12] !log Wiki replicas: ran `sudo maintain-views --table page_assessments --database trwiki` on all 3 servers for T191455 [21:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:18] T191455: trwiki_p.page_assessments missing on replicas - https://phabricator.wikimedia.org/T191455 [21:54:26] (03PS1) 10RobH: decom ocg100[1-3] prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/424149 (https://phabricator.wikimedia.org/T177958) [21:54:58] (03CR) 10RobH: [C: 032] decom ocg100[1-3] prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/424149 (https://phabricator.wikimedia.org/T177958) (owner: 10RobH) [21:57:52] jouncebot: next [21:57:52] In 1 hour(s) and 2 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T2300) [21:58:12] <3 jouncebot [21:58:13] (03PS2) 10Dzahn: Revert "Revert "switch deployment server from tin to deploy1001"" [puppet] - 10https://gerrit.wikimedia.org/r/422632 [21:58:42] (03PS1) 10Ayounsi: Puppet: add ping_offload role and profile [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) [22:01:56] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/10821/" [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [22:03:00] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4106942 (10ayounsi) [22:17:05] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#4106961 (10EddieGP) [22:18:59] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-prep down hosts - fix/remove? - https://phabricator.wikimedia.org/T191293#4100460 (10EddieGP) @brion commented on T174477. We can go ahead a delete deployment-tmh01. Still need confirmation for deployment-videoscaler01. [22:19:33] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#4106964 (10RobH) [22:19:49] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#3676074 (10RobH) a:05RobH>03Cmjohnson [22:21:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4106972 (10ayounsi) About kernel tuning, here are the variables we can adjust as necessary, with their default. ``` 50 -- /proc/sys/net/ipv4/icmp_msgs_burst 1000 -- /proc/sys... [22:23:29] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-redis0[12] - https://phabricator.wikimedia.org/T191163#4106974 (10EddieGP) [22:23:34] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4106977 (10EddieGP) [22:25:25] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4106983 (10EddieGP) [22:25:29] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#4106982 (10EddieGP) [22:25:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3722645 (10EddieGP) [22:26:47] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1004.eqiad.wmnet are marked down but pooled [22:27:02] PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [22:28:01] RECOVERY - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.006 second response time [22:28:02] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [22:28:10] what's up with mathoid? [22:29:02] seems kubernetes wanted to go down but wasnt allowed? [22:29:08] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1004.eqiad.wmnet are marked down but pooled [22:29:15] so yeah... it tried to fail out but wasnt allowed [22:29:31] oh, sorry, misparsed that not quite what i said. [22:29:37] there are a bunch of icinga criticals for kubernetes100[1-4] [22:29:39] CRITICAL - 'scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job="k8s-node", instance="kubernetes1001.eqiad.wmnet"}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job="k8s-node", instance="kubernetes1001.eqiad.wmnet"}[5m])))': 37027.83480825958 >= 15000.0 [22:29:58] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2589035 (10EddieGP) There's 4 trusty instances left in deployment-prep: - deployment-tmh01 is to be deleted per T174477/T191293 - deployment-redis0[12] are to be replaced by... [22:30:19] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 69195.93514036783 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:30:34] got paged as well [22:30:49] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 45643.16223908918 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:31:10] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 35439.38850346878 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:31:19] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:49] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:36:19] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:36:20] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:41:19] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:42:48] I am around [22:43:07] akosiaris: hey [22:43:21] this is as far as I got with the investigation https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1&panelId=23&fullscreen :) [22:44:06] what on earth happened ... [22:44:43] hmm, there's an increase in traffic [22:45:57] akosiaris: is mathoid the only service running on k8s? [22:46:02] yes [22:47:35] ah there we go https://grafana.wikimedia.org/dashboard/db/service-mathoid?panelId=7&fullscreen&orgId=1&from=1522880524625&to=1522881602825 [22:47:47] increased number of requests [22:48:12] I guess the fix is to just increase the number of "workers". Lemme triple the initial one [22:48:23] actually quadruple it [22:49:01] 10 req/s seem... doable? [22:49:18] it's only 4 pods currently [22:51:11] logs have nothing really apart from TeX parse error: Double subscripts: use braces to clarify [22:51:29] but that just means some expression is wrong [22:51:57] I did not get the page btw, it's almost 02:00 am over here [22:52:38] I was browsing icinga to see if it still flapping [22:52:53] 2018-04-04 22:52:26 +0000 UTC 2018-03-26 18:16:56 +0000 UTC 58 mathoid-gangly-mongoose-261263940-43d3x Pod spec.containers{mathoid-gangly-mongoose} Warning Unhealthy kubelet, kubernetes1003.eqiad.wmnet Readiness probe failed: Get http://10.64.64.174:10044/_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers) [22:52:54] but yeah I got paged thanks to dst [22:53:15] PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.20 and port 10042: Connection refused [22:53:17] hmm it still is having issues, I 'll increase the number of pods right now, no point in stalling [22:53:25] another page, yep [22:54:24] !log increase the number of mathoid pods to 16 from 4 [22:54:28] connection refused eh [22:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:39] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet]) [22:54:47] yeah most pods have multiple restarts [22:54:50] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet]) [22:55:09] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled [22:55:19] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled [22:55:39] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 134543.9641577061 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:56:29] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 159562.94905660374 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [22:57:07] I'll check pybal logs too [22:59:29] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:59:35] this is weird. I 've already increased the available workers it shoud have already recovered [22:59:45] godog, akosiaris: lvs1006 says all of mathoid is down [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180404T2300). [23:00:04] James_F, RoanKattouw, and Jhs: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:21] indeed, timeout while fetching looks like [23:00:28] Apr 4 22:59:49 lvs1006 pybal[15265]: [mathoid_10042 ProxyFetch] WARN: kubernetes1002.eqiad.wmnet (enabled/down/pooled): Fetch failed, 32.768 s [23:00:49] I'll SWAT [23:00:59] i'm here :) [23:01:09] akosiaris: would it be doable to go back to scb100[1-4]? [23:01:12] Excellent, then yours can go first [23:01:16] ema: yes [23:01:22] I am thinking about it now [23:01:50] (03CR) 10Catrope: [C: 032] Add namespace for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424065 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [23:01:56] <_joe_> akosiaris: you have the wrong service IP [23:02:06] RoanKattouw, just to give some info: we did this patch earlier today, and it gave some weird errors. Loop error or something, so we should check on mwdebug1002 just to be sure [23:02:16] <_joe_> KUBECONFIG=/etc/kubernetes/admin-eqiad.config kubectl -n mathoid get services [23:02:29] RoanKattouw, but my guess is that actually had to do with some other merges that were done around the same time. but still, we should be cautious [23:02:47] <_joe_> unless something is just very not clear to me [23:03:04] that is weird [23:03:08] (03Merged) 10jenkins-bot: Add namespace for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424065 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [23:03:21] OK let's check then [23:03:23] <_joe_> it has 10.2.1.20 which is codfw [23:03:26] (03CR) 10jenkins-bot: Add namespace for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424065 (https://phabricator.wikimedia.org/T191396) (owner: 10Jon Harald Søby) [23:03:47] <_joe_> the fastest way to solve the problem is to switch mathoid to codfw via discovery, btw [23:04:01] no, give me a sec [23:04:45] Jhs: OK it's on mwdebug1002 now [23:04:52] ok, checking [23:05:15] <_joe_> akosiaris: still not responding AFAICS [23:05:45] yeah it's not the service IP at fault here, it's the policy [23:05:57] the network policy... for some reason there is none... [23:06:09] RoanKattouw, looks good now, yeah. nothing like the errors before at all [23:06:43] OK, deploying then [23:06:44] let me try this in another way [23:08:50] (03PS2) 10Catrope: For wikis with consolidated feedback, send 2017WTE notes to a better page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424147 (https://phabricator.wikimedia.org/T157953) (owner: 10Jforrester) [23:08:57] (03CR) 10Catrope: [C: 032] For wikis with consolidated feedback, send 2017WTE notes to a better page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424147 (https://phabricator.wikimedia.org/T157953) (owner: 10Jforrester) [23:09:11] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add Txikipedia namespace on euwiki (T191396) (duration: 01m 18s) [23:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:17] T191396: Add extra namespace in Basque Wikipedia - https://phabricator.wikimedia.org/T191396 [23:09:20] Jhs: Alright, deployed [23:09:24] i originally wanted to switch the deployment server today but i stopped that, didn't seem a good idea to add more moving parts before that deploy and with the other issues [23:09:35] RoanKattouw, great, thanks! [23:09:41] RECOVERY - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.005 second response time [23:09:57] ema: _joe_ godog: ok fixed [23:10:08] (03Merged) 10jenkins-bot: For wikis with consolidated feedback, send 2017WTE notes to a better page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424147 (https://phabricator.wikimedia.org/T157953) (owner: 10Jforrester) [23:10:09] RoanKattouw, remember namespaceDupes.php [23:10:15] Ah yes will run that now [23:10:18] <_joe_> what happened, apart from the wrong external ip? [23:10:22] akosiaris: \o/ [23:10:23] either I messed up the helm upgrade step or helm messed something up [23:10:27] <_joe_> well, I don't really care right now :) [23:10:37] <_joe_> see you on monday [23:10:44] Done [23:10:44] see you _joe_ [23:10:50] see ya [23:11:10] awesome RoanKattouw! & good night, i'm off to bed :) [23:11:15] I'll go as well, looks like it is fixed! [23:11:29] ok thanks. I 'll stick around a bit more just to make sure everything is ok [23:12:09] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 113988.1171586716 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:12:38] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 102424.66697416976 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [23:12:39] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 116546.3438077634 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:12:45] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Set $wgVisualEditorSourceFeedbackTitle (no-op until later) (T157953) (duration: 01m 16s) [23:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:52] T157953: When in source mode, point the feedback form to a page about that rather than the VE feedback page - https://phabricator.wikimedia.org/T157953 [23:13:11] the latencies are normal I think with all the pods I was adding [23:13:17] akosiaris: any difference between kubernetes1001-1002 and 1003-1004? The former are ok according to pybal, the latter are not [23:13:33] no there should not be any diff [23:13:52] oh ok now 1004 is also fine [23:14:08] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:14:09] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:14:30] and 1003 as well [23:14:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [23:14:39] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:14:44] hmm it should not have taken so long though [23:14:48] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:14:49] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [23:15:19] ok I need to retrace what I did that messed this up [23:15:28] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [23:17:10] ah ok found it [23:17:17] PEBKAC, sorry about that [23:17:31] it should have been way faster to recover from this. My mistake [23:17:54] akosiaris: here is what lvs1006 thought of mathoid today https://phabricator.wikimedia.org/P694 [23:18:15] ? [23:18:32] Authored by akosiaris on May 28 2015, 00:15. ? [23:18:42] ema: wrong paste ? [23:18:47] haha indeed [23:18:50] https://phabricator.wikimedia.org/P6948 [23:19:00] and of all the people it had to be by me ? [23:19:38] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [23:19:57] (03CR) 10jenkins-bot: For wikis with consolidated feedback, send 2017WTE notes to a better page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424147 (https://phabricator.wikimedia.org/T157953) (owner: 10Jforrester) [23:21:03] akosiaris: so yeah things started looking problematic around 22:24 [23:21:07] ema: ok the 06:23 (logrotate) and 02:21 hours got me a bit worried but it looks like those were transient and recovered very quickly [23:21:19] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4107106 (10Springle) [23:21:47] but the 22:24 corelates nicely with the spike in https://grafana.wikimedia.org/dashboard/db/service-mathoid?panelId=7&fullscreen&orgId=1&from=now-3h&to=now [23:22:09] (03CR) 10Chad: [C: 031] Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox) [23:22:17] yep [23:22:36] there's been another spike previously (2018-04-04 11:43) https://grafana.wikimedia.org/dashboard/db/service-mathoid?panelId=7&fullscreen&orgId=1&from=1522840802672&to=1522843510558 [23:22:57] yeah multiple ones... just not large enough to trigger alerts ? [23:23:35] anyway we can easily increase the amount of workers (pods) [23:24:48] and we are starting to learn something from running services on this infrastructure (if only it wasn't at this hour) [23:25:22] anyway, I 'll draft an incident report tomorrow [23:25:42] ok, anything left to do now? [23:26:24] no I don't think so [23:26:35] off to bed [23:26:41] alright, good night then! [23:32:34] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4107141 (10RobH) [23:32:45] 10Operations, 10Ops-Access-Requests: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4107106 (10RobH) p:05Triage>03Normal [23:33:59] RoanKattouw: Did you sync out T191103's fix without mention it? [23:34:00] T191103: Dragging a template to the top of the page shouldn't remove it - https://phabricator.wikimedia.org/T191103 [23:34:11] No [23:34:16] Only to mwdebug [23:34:19] Right. [23:34:22] No wait not even that [23:35:12] Ha. [23:36:45] !log catrope@tin Synchronized php-1.31.0-wmf.28/resources/src/mediawiki.rcfilters/: Fix missing bookmark icon (T191366) (duration: 01m 16s) [23:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:51] T191366: [regression - wmf.28] The bookmark icon for Saved filters is missing - https://phabricator.wikimedia.org/T191366 [23:37:40] (03PS1) 10BryanDavis: wiki replicas: drop views with missing tables [puppet] - 10https://gerrit.wikimedia.org/r/424166 (https://phabricator.wikimedia.org/T191387) [23:39:59] Confirmed broken in prod and fixed on mwdebug [23:40:30] RoanKattouw: Yay. [23:42:25] !log catrope@tin Synchronized php-1.31.0-wmf.27/extensions/VisualEditor/lib/ve: Fix VE drag-and-drop bugs (T191103) (duration: 01m 17s) [23:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:32] T191103: Dragging a template to the top of the page shouldn't remove it - https://phabricator.wikimedia.org/T191103 [23:46:44] (03PS2) 10Dzahn: misc_static_sites: temp disable bromine backend for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) [23:48:44] (03PS1) 10BryanDavis: wiki replicas: Remove localisation and localisation_file_hash views [puppet] - 10https://gerrit.wikimedia.org/r/424168 (https://phabricator.wikimedia.org/T119811) [23:50:13] !log andrew@tin Started deploy [horizon/deploy@2c55bd5]: (no justification provided) [23:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:22] !log andrew@tin Finished deploy [horizon/deploy@2c55bd5]: (no justification provided) (duration: 03m 10s) [23:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:10] (03PS1) 10Pnorman: Fix assorted bugs in process-osm-data script for new schema [puppet] - 10https://gerrit.wikimedia.org/r/424170 (https://phabricator.wikimedia.org/T191345) [23:59:08] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 3 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4107259 (10DarTar) @Ottomata if you can review the privs in the changeset and confirm they are good, that'd be awesome.