[00:07:51] (03CR) 1020after4: [C: 03+1] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:14:49] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:16:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [00:19:13] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:11] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [00:30:22] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [00:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:11] 10Operations, 10Cloud-Services, 10Developer-Advocacy, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10bd808) [00:38:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1010.eqiad.wmnet'] `... [00:42:39] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:44:02] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:40] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:44:51] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:55] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:41] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:45:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:46:00] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:46:03] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [00:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:43] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:46:44] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [00:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for host... [00:48:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [00:51:30] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10RKemper) [01:02:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:21] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:21] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:02:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:02:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:02:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:02:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1005.eqiad.wmnet', 'c... [01:15:17] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [01:29:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'cloudcephosd1004' - https://phabricator.wikimedia.org/T258764 (10Andrew) [01:59:43] (03CR) 10Cwhite: [C: 03+1] profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi) [02:00:54] (03CR) 10Cwhite: [C: 03+1] smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi) [02:02:05] (03CR) 10Cwhite: [C: 03+1] profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [02:06:37] (03PS4) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [02:06:39] (03PS1) 10Andrew Bogott: Add icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615906 (https://phabricator.wikimedia.org/T258764) [02:07:12] (03CR) 10jerkins-bot: [V: 04-1] Add icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615906 (https://phabricator.wikimedia.org/T258764) (owner: 10Andrew Bogott) [02:08:39] (03PS2) 10Andrew Bogott: Add icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615906 (https://phabricator.wikimedia.org/T258764) [02:08:41] (03PS5) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [02:09:10] (03CR) 10Andrew Bogott: [C: 03+2] Add icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615906 (https://phabricator.wikimedia.org/T258764) (owner: 10Andrew Bogott) [02:19:50] (03PS6) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [02:19:52] (03PS1) 10Andrew Bogott: 2nd attempt at icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615907 (https://phabricator.wikimedia.org/T258764) [02:20:24] (03CR) 10jerkins-bot: [V: 04-1] 2nd attempt at icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615907 (https://phabricator.wikimedia.org/T258764) (owner: 10Andrew Bogott) [02:21:46] (03PS2) 10Andrew Bogott: Second try at icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615907 (https://phabricator.wikimedia.org/T258764) [02:21:48] (03PS7) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [02:22:28] (03CR) 10Andrew Bogott: [C: 03+2] Second try at icinga descriptions for cloudsw1-d5-eqiad and cloudsw1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615907 (https://phabricator.wikimedia.org/T258764) (owner: 10Andrew Bogott) [02:36:40] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/includes/Score.php: removing superseded local patch for hard-coding lilypond version (duration: 01m 09s) [02:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:45] 10Puppet, 10MobileFrontend (Tracking), 10Readers-Web-Backlog (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson) [03:26:56] (03PS1) 10Tim Starling: Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615861 [03:27:23] (03PS1) 10Tim Starling: Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615862 [05:01:15] (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615926 (https://phabricator.wikimedia.org/T257540) [05:01:59] (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615926 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [05:08:02] (03CR) 10Marostegui: [C: 03+1] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/615823 (https://phabricator.wikimedia.org/T257987) (owner: 10Bstorm) [05:14:47] (03CR) 10Marostegui: [C: 03+1] "Reminder: requires manual removal from the DB itself too" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [05:31:23] (03CR) 10Marostegui: [C: 03+1] mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [06:07:46] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) It will probably be re-enabled in safe mode early next week. Hopefully Monday my time (i.e... [06:08:57] (03CR) 10Tim Starling: [C: 03+2] Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615861 (owner: 10Tim Starling) [06:09:07] (03CR) 10Tim Starling: [C: 03+2] Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615862 (owner: 10Tim Starling) [06:23:11] (03PS4) 10ZPapierski: Migrate wcqs to wcqs-beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 [06:25:45] (03Merged) 10jenkins-bot: Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615861 (owner: 10Tim Starling) [06:25:59] (03Merged) 10jenkins-bot: Run Ghostscript from MediaWiki instead of having LilyPond do it [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615862 (owner: 10Tim Starling) [06:30:37] !log tstarling@deploy1001 Started scap: for Score [06:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:28] (03CR) 10ZPapierski: "NOOP for production servers: https://puppet-compiler.wmflabs.org/compiler1003/24118/" [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski) [06:39:55] (03CR) 10Elukey: [C: 03+1] Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200724T0700) [07:00:11] (03CR) 10JMeybohm: [C: 04-1] GC: fix reported counter (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [07:01:59] (03CR) 10Muehlenhoff: "Stretch doesn't have lilypond and it was implicitly pulled in from stretch-backports. stretch-backports will go away soon as stretch is no" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling) [07:02:00] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi) [07:03:08] (03CR) 10Ayounsi: WMCS Ceph: add address entries for new OSD nodes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [07:06:35] 10Operations, 10observability, 10Patch-For-Review: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! Now both netmon2001 and netmon1002 smokeping daemons are running all th... [07:06:37] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [07:12:02] (03CR) 10Filippo Giunchedi: "IIRC we generally don't see this alert firing for weblog (and centrallog as of next week) so I'm tempted to say that it isn't the typical " [puppet] - 10https://gerrit.wikimedia.org/r/615826 (owner: 10Dzahn) [07:12:50] 10Operations, 10Analytics: Move Superset to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) [07:13:46] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [07:14:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:15:15] (03PS1) 10Muehlenhoff: Add DNS record for an-tool1009 [dns] - 10https://gerrit.wikimedia.org/r/616011 (https://phabricator.wikimedia.org/T258768) [07:16:25] (03PS1) 10Jcrespo: mariadb: Set db1077 in read-write [puppet] - 10https://gerrit.wikimedia.org/r/616012 (https://phabricator.wikimedia.org/T257928) [07:16:59] (03PS1) 10Ayounsi: Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) [07:17:25] (03CR) 10jerkins-bot: [V: 04-1] Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:17:27] (03CR) 10Elukey: [C: 03+1] Add DNS record for an-tool1009 [dns] - 10https://gerrit.wikimedia.org/r/616011 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [07:17:51] 10Operations, 10Analytics, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) [07:17:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice work! This is looking pretty good already. A number of comments inline" (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [07:18:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] mariadb: Set db1077 in read-write [puppet] - 10https://gerrit.wikimedia.org/r/616012 (https://phabricator.wikimedia.org/T257928) (owner: 10Jcrespo) [07:18:49] (03PS2) 10Ayounsi: Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) [07:19:11] (03CR) 10jerkins-bot: [V: 04-1] Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:20:58] (03PS3) 10Ayounsi: Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) [07:21:05] (03CR) 10Jcrespo: [C: 03+2] mariadb: Set db1077 in read-write [puppet] - 10https://gerrit.wikimedia.org/r/616012 (https://phabricator.wikimedia.org/T257928) (owner: 10Jcrespo) [07:21:14] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/24119/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:21:42] (03CR) 10Ayounsi: [C: 03+2] Add cloudsw1 switches to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/616013 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:24:34] (03PS1) 10Ayounsi: cloudsw1: fix typo esams -> eqiad [puppet] - 10https://gerrit.wikimedia.org/r/616014 (https://phabricator.wikimedia.org/T251632) [07:24:38] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS record for an-tool1009 [dns] - 10https://gerrit.wikimedia.org/r/616011 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [07:25:03] (03CR) 10Ayounsi: [C: 03+2] cloudsw1: fix typo esams -> eqiad [puppet] - 10https://gerrit.wikimedia.org/r/616014 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:27:39] (03CR) 10Jcrespo: [C: 03+1] "As long as the monitoring is up making sure that the content is not public, no issue." [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [07:30:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'cloudcephosd1004' - https://phabricator.wikimedia.org/T258764 (10ayounsi) 05Open→03Resolved Indeed, fixed, they now show up: https://ici... [07:30:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ayounsi) [07:30:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'cloudcephosd1004' - https://phabricator.wikimedia.org/T258764 (10ayounsi) a:05Cmjohnson→03ayounsi [07:31:46] 10Operations, 10vm-requests: eqiad: New Ganeti instance for Hue (an-tool1009) - https://phabricator.wikimedia.org/T258771 (10MoritzMuehlenhoff) [07:34:51] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [07:36:38] 10Operations, 10vm-requests: eqiad: New Ganeti instance for Hue (an-tool1009) - https://phabricator.wikimedia.org/T258771 (10elukey) Maybe 2 vcores would be better, but we can always expand later on if needed. Thanks! [07:39:49] PROBLEM - Disk space on wtp1025 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=78%): /tmp 0 MB (0% inode=78%): /var/tmp 0 MB (0% inode=78%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wtp1025&var-datasource=eqiad+prometheus/ops [07:41:42] checking [07:42:18] (03Abandoned) 10Jcrespo: mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [07:44:07] !log depool wtp1025 - disk full [07:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:51] (03PS1) 10DCausse: Revert "Revert "[wdqs] add a new streaming updater profile"" [puppet] - 10https://gerrit.wikimedia.org/r/616027 [07:45:35] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Joe) a:03Joe [07:45:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:46:55] the exceptions were also due to wtp1025 --^ [07:50:39] (03PS19) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [07:51:44] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [07:57:05] (03PS3) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 [07:57:28] (03CR) 10Ayounsi: "Thanks for the quick review, comments addressed." (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [07:58:51] (03PS20) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [08:00:02] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [08:11:25] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10ayounsi) Back to `DISK CRITICAL - free space: /srv 2946 MB (1% inode=99%):` ACKing the alert. [08:11:50] ACKNOWLEDGEMENT - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 2946 MB (1% inode=99%): ayounsi https://phabricator.wikimedia.org/T257931 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [08:11:57] (03PS21) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [08:12:53] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [08:13:47] XioNoX: there is no "back to" the warning was from a different host [08:14:35] ah right :) [08:14:53] end result is the same, alert acked [08:16:41] (03PS1) 10Kormat: realm: Add oauth_ratelimit_client_tier to private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/616018 (https://phabricator.wikimedia.org/T258711) [08:16:43] PROBLEM - Check systemd state on wtp1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:00] 10Operations, 10serviceops: wtp1025's root partition full - https://phabricator.wikimedia.org/T258775 (10elukey) [08:18:05] PROBLEM - parsoid on wtp1025 is CRITICAL: connect to address 10.64.0.239 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [08:18:35] RECOVERY - Check systemd state on wtp1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:57] RECOVERY - parsoid on wtp1025 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [08:20:07] PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:20:41] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:55] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:51] RECOVERY - Disk space on wtp1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wtp1025&var-datasource=eqiad+prometheus/ops [08:23:07] (03CR) 10Marostegui: [C: 03+1] "This requires MYSQL restart on db1124, db1125, db2094 and db2095. I would also like to keep the task open so they can notify when the tabl" [puppet] - 10https://gerrit.wikimedia.org/r/616018 (https://phabricator.wikimedia.org/T258711) (owner: 10Kormat) [08:23:09] 10Operations, 10Discovery-Search, 10Elasticsearch: Reindex commonswiki as shards have grown beyond critical threshold - https://phabricator.wikimedia.org/T231446 (10Gehel) 05Open→03Resolved a:03Gehel [08:24:07] (03CR) 10Kormat: [C: 03+2] realm: Add oauth_ratelimit_client_tier to private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/616018 (https://phabricator.wikimedia.org/T258711) (owner: 10Kormat) [08:24:33] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Gehel) 05Open→03Resolved [08:25:49] PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:26:21] PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:28:05] 10Operations, 10Discovery-Search, 10Elasticsearch: Add more metrics to upstream's elasticsearch exporter. - https://phabricator.wikimedia.org/T214547 (10Gehel) 05Open→03Declined This has not been a need recently. Let's reopen if we actually need it in the future. [08:28:08] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Gehel) [08:28:43] PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [08:28:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 53 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:29:01] 10Operations, 10Discovery-Search, 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10Gehel) 05Open→03Resolved a:03Gehel Cookbooks are now available for all major operations on elasticsearch. There are still improvements to b... [08:29:03] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Gehel) [08:30:47] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Make elasticsearch configuration more robust to loss of network connectivity - https://phabricator.wikimedia.org/T143552 (10Gehel) 05Open→03Declined This has not been an issue recently. The current configuration does raise a few alerts in... [08:31:42] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: show recovery status/stats in es-tool - https://phabricator.wikimedia.org/T104022 (10Gehel) 05Open→03Declined es-tool has been replaced by spicerack cookbooks [08:33:50] Jul 24 08:15:33 kubernetes2002 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT [08:33:50] Jul 24 08:15:33 kubernetes2002 systemd[1]: nagios-nrpe-server.service: Failed to fork: Resource temporarily unavailable [08:33:57] interesting [08:34:04] where was that task [08:34:34] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) [08:35:24] !log start nagios-nrpe-server on kubernetes2002 [08:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:37] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:48] (03PS2) 10Peter.ovchyn: Add defaults for initial state for sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610069 (https://phabricator.wikimedia.org/T254230) [08:36:36] (03PS5) 10Peter.ovchyn: Remove WPBSkinBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) [08:36:53] RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:37:11] RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:38:33] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:02] !log restarting mariadb on all sanitarium hosts T258711 [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:08] T258711: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 [08:40:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:42:00] kormat: the hosts are on icinga, did you downtime them? [08:42:07] maybe downtime wasn't processed? [08:42:14] i.. crap. i did not. [08:45:13] (03CR) 1020after4: [V: 03+2 C: 03+2] Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [08:46:12] phew, ok. managed to downtime them all before anything alerted. 😓 [08:49:37] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [08:56:41] RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:02:16] (03PS5) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) [09:02:55] (03PS9) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [09:03:55] (03CR) 10Jbond: [C: 04-1] admins: let wdqs-admins view nginx logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [09:05:31] (03PS1) 10Ema: Go mod configuration [software/atskafka] - 10https://gerrit.wikimedia.org/r/616021 [09:05:33] (03PS1) 10Ema: Use testify for testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/616022 [09:05:35] (03PS1) 10Ema: Move to dh-golang [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 [09:05:54] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:06:20] (03CR) 10jerkins-bot: [V: 04-1] Go mod configuration [software/atskafka] - 10https://gerrit.wikimedia.org/r/616021 (owner: 10Ema) [09:06:22] (03CR) 10jerkins-bot: [V: 04-1] Use testify for testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/616022 (owner: 10Ema) [09:06:31] (03Merged) 10jenkins-bot: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:09:53] (03CR) 10Jcrespo: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:10:49] (03PS22) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [09:10:50] (03PS1) 10Alexandros Kosiaris: Rakefile: Correctly match start of YAML docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/616024 [09:11:37] (03CR) 10Jcrespo: "I am going to test this by having a host with a free port and another with it pre-reserved." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:16:00] (03PS1) 10Elukey: druid: add metrics for version 0.19 and update Druid test's config [puppet] - 10https://gerrit.wikimedia.org/r/616025 (https://phabricator.wikimedia.org/T244482) [09:17:17] (03CR) 10jerkins-bot: [V: 04-1] druid: add metrics for version 0.19 and update Druid test's config [puppet] - 10https://gerrit.wikimedia.org/r/616025 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [09:17:25] uuffff [09:19:21] (03PS2) 10Elukey: druid: add metrics for version 0.19 and update Druid test's config [puppet] - 10https://gerrit.wikimedia.org/r/616025 (https://phabricator.wikimedia.org/T244482) [09:20:03] (03CR) 10Jcrespo: "Some questions..." (034 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:23:29] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [09:26:16] (03PS1) 10Ema: Release version 0.10 [software/atskafka] - 10https://gerrit.wikimedia.org/r/616046 [09:26:21] (03CR) 10Elukey: [C: 03+2] druid: add metrics for version 0.19 and update Druid test's config [puppet] - 10https://gerrit.wikimedia.org/r/616025 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [09:30:01] (03PS2) 10Ema: Move to dh-golang [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 [09:30:02] (03PS2) 10Ema: Go mod configuration [software/atskafka] - 10https://gerrit.wikimedia.org/r/616021 [09:30:04] (03PS2) 10Ema: Use testify for testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/616022 [09:30:08] (03PS2) 10Ema: Release version 0.10 [software/atskafka] - 10https://gerrit.wikimedia.org/r/616046 [09:31:43] (03PS1) 10Jbond: admin: add check to detect vi, vim and view in sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/616048 [09:31:44] (03PS1) 10Jbond: admin: add bad sudo commands to test CI [puppet] - 10https://gerrit.wikimedia.org/r/616049 [09:32:18] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:32:42] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:32:45] (03CR) 10Jcrespo: [C: 03+2] "I was able to transfer to 2 hosts with different detected port after applying this." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:32:49] (03CR) 10jerkins-bot: [V: 04-1] admin: add bad sudo commands to test CI [puppet] - 10https://gerrit.wikimedia.org/r/616049 (owner: 10Jbond) [09:33:09] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:33:45] (03PS4) 10Jcrespo: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:34:08] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:35:00] (03PS2) 10Jbond: admin: add check to detect vi, vim and view in sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/616048 [09:35:06] (03CR) 10Jcrespo: "Unit tests would need updating because of the parameter changes." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [09:35:14] (03PS2) 10Jbond: admin: add bad sudo commands to test CI [puppet] - 10https://gerrit.wikimedia.org/r/616049 [09:36:22] (03CR) 10jerkins-bot: [V: 04-1] admin: add bad sudo commands to test CI [puppet] - 10https://gerrit.wikimedia.org/r/616049 (owner: 10Jbond) [09:36:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one comment inline" (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 (owner: 10Ema) [09:36:52] (03CR) 10Jbond: "ready for review, See next PS in relation chain for test cases" [puppet] - 10https://gerrit.wikimedia.org/r/616048 (owner: 10Jbond) [09:37:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [09:48:20] addshore: hi there :) you have the dubious distinction as being listed as a "Gerrit Manager", so i'm wondering if you can help me [09:48:46] i've created a branch on operations/puppet (`sandbox/kormat/pontoon-mariadb104-test`), but it appears i do not have rights to force-push to it [09:48:47] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [09:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:05] addshore: do you know if it's possible to get said rights, or who i'd need to bother about it? [09:49:43] Its deifntly possible to get that right, but I dont think I'd be able to do it, but let me find the group of people that would eb able to / where to look [09:50:49] (03CR) 10Zfilipin: [C: 03+1] "`npm t` and `npm run selenium` pass! https://phabricator.wikimedia.org/P12035" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [09:51:13] kormat: so anyone in ldap/ops can alter the repo settings, and that can be done on https://gerrit.wikimedia.org/r/admin/repos/operations/puppet,access [09:52:04] that has various permissions, such as just push, and then youd just need the right ref [09:52:35] oh! so i could actually do it myself. thanks <3 [09:53:07] I dont know if "push" will let you force push, but you can give it a go. Also you need to add a group (not an individual) with the right, so i guess you cna reuse ldap/ops :) [09:54:08] addshore: kormat: there is a checkbox "allow force push" [09:55:25] https://usercontent.irccloud-cdn.com/file/5oEN6Lf2/image.png [09:55:40] Urbanecm: ahh, great [09:57:05] kormat: https://gerrit.wikimedia.org/r/admin/repos/All-Projects,access defines "users can push to their own sandbox", if you want to inspire in how to override that [09:57:16] it worked \o/ https://usercontent.irccloud-cdn.com/file/3nTLJwbj/image.png [09:57:25] cool! [09:57:48] force push, what could possibly got wrong! :-DDDD [09:57:52] *ĝo [09:57:55] *go [09:58:03] jynus: everything, but at speed! :) [09:58:12] hopefulyl not in sandboxes [09:58:17] jk ofc [09:58:36] addshore, Urbanecm: thanks! <3 [09:58:40] hth! [09:58:47] Urbanecm: although wait for all production to be dependent on kormat's sanbox some time soon [09:58:53] xD [09:58:53] :-) [09:59:14] :D [09:59:22] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:10] btw...would it be possible to get my idp.wikimedia.org logins use MFA? I do have a yubikey, and I love to use it everywhere where I can :) [10:01:52] Urbanecm: sure :-) Please see https://wikitech.wikimedia.org/wiki/CAS-SSO#Requesting_to_enable_U2F_(if_you're_not_in_SRE) [10:02:15] thanks! [10:02:41] (03PS5) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) [10:03:33] 10Operations, 10LDAP-Access-Requests: Enable UF2 for Urbanecm's account - https://phabricator.wikimedia.org/T258781 (10Urbanecm) [10:03:46] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:03:58] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:04:25] (03Merged) 10jenkins-bot: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:12:26] (03CR) 10Jcrespo: "First run of test look successful, only one comment below regarding UI/user friendliness." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [10:17:47] (03CR) 10Privacybatm: "I am working on the exception creation now, after that I will make a new patch with all the changes. But please see my comment about lock " (034 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:24:34] (03CR) 10Jcrespo: "Either me or you didn't understood the question 0:-)." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:25:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:27:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:21] (03PS2) 10Effie Mouzeli: Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) [10:31:33] (03PS1) 10Muehlenhoff: Add an-tool1009 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/616057 (https://phabricator.wikimedia.org/T258768) [10:34:02] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Nonovian) Hi. as per [[ https://fr.wikinews.org/wiki/Wikinews:Prise_de_décision/Nouvelle_donne_pour_Wikinews#Prop... [10:34:09] (03PS1) 10Jbond: admin: drop pentester privileges [puppet] - 10https://gerrit.wikimedia.org/r/616058 [10:35:05] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts: ` restbase-dev1004.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202007241034_hn... [10:35:21] !log started reimage of restbase-dev1004 [10:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:06] (03PS2) 10Muehlenhoff: Add an-tool1009 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/616057 (https://phabricator.wikimedia.org/T258768) [10:36:30] (03PS1) 10Daniel Kinzler: Import: use master DB for loading slots. [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616029 (https://phabricator.wikimedia.org/T258666) [10:37:21] (03PS2) 10Daniel Kinzler: Import: use master DB for loading slots. [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616029 (https://phabricator.wikimedia.org/T258666) [10:38:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add certificates and API keys for push-notifications (036 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) (owner: 10Effie Mouzeli) [10:44:23] (03PS4) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [10:44:31] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:49:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [10:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:53] (03PS5) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:00:00] (03CR) 10Privacybatm: "Yeah, I understood the question now 😄" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:02:45] (03CR) 10Privacybatm: "> Patch Set 5:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:05:24] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase-dev1004.eqiad.wmnet'] ` and were **ALL** successful. [11:08:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:10:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:13:42] !log started bootstrap of restbase-dev1004-a [11:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] (03CR) 10Privacybatm: "Happy to hear it is a nice feature :-) Thank you!" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [11:18:10] (03PS5) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [11:19:57] (03PS6) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [11:21:12] (03PS1) 10JMeybohm: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) [11:22:02] (03CR) 10Ladsgroup: [C: 03+2] "Emergency deployment" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616029 (https://phabricator.wikimedia.org/T258666) (owner: 10Daniel Kinzler) [11:23:12] (03PS2) 10JMeybohm: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) [11:24:32] (03PS3) 10JMeybohm: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) [11:29:25] (03PS3) 10Effie Mouzeli: Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) [11:32:05] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:32:20] (03CR) 10Ema: Move to dh-golang (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 (owner: 10Ema) [11:32:55] (03PS1) 10Ssingh: wikidough: set TLSv1.2 as the minimum version for DoT [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) [11:33:58] 10Operations, 10ops-eqiad, 10Platform Team Workboards (Green): Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) [11:34:51] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:36:39] (03PS2) 10Ssingh: wikidough: set TLSv1.2 as the minimum version for DoT [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) [11:37:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [11:38:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) (owner: 10Effie Mouzeli) [11:39:45] (03CR) 10Ssingh: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24123/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:40:39] (03Merged) 10jenkins-bot: Import: use master DB for loading slots. [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616029 (https://phabricator.wikimedia.org/T258666) (owner: 10Daniel Kinzler) [11:42:49] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042) (owner: 10Effie Mouzeli) [11:46:41] (03PS1) 10Muehlenhoff: Remove access for drossi/fsalurati [puppet] - 10https://gerrit.wikimedia.org/r/616072 [11:47:33] (03CR) 10jerkins-bot: [V: 04-1] Remove access for drossi/fsalurati [puppet] - 10https://gerrit.wikimedia.org/r/616072 (owner: 10Muehlenhoff) [11:47:49] (03CR) 10Ssingh: "Can you please review the TLS1.2 cipher suites and their order? Thanks! (Note that dnsdist sets the order preferred by the server.)" [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:48:31] 10Operations, 10ops-eqiad, 10Platform Team Workboards (Green): Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) 05Open→03Resolved [11:48:47] !log bootstrapped restbase-dev1004-b [11:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:51] (03PS2) 10Effie Mouzeli: hosts: assign puppet role for rdb2007,rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) [11:50:22] TimStarling: Let me know once you're done (I'm deploying for T258666 atm) [11:50:23] T258666: RevisionAccessException when trying to import files with FileImporter - https://phabricator.wikimedia.org/T258666 [11:56:02] (03PS2) 10Muehlenhoff: Remove access for drossi/fsalurati [puppet] - 10https://gerrit.wikimedia.org/r/616072 [12:01:53] (03CR) 10JMeybohm: [C: 03+2] mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [12:02:32] (03Merged) 10jenkins-bot: mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [12:04:15] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [12:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:18] (03CR) 10JMeybohm: helmfile: strawman refactoring (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:14:03] (03CR) 10Hnowlan: "> is this a good idea? shouldn't we just expose the envoy health admin endpoint and monitor that?" [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [12:27:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for drossi/fsalurati [puppet] - 10https://gerrit.wikimedia.org/r/616072 (owner: 10Muehlenhoff) [12:28:52] (03PS1) 10Jbond: admin: drop strace/tcpdump permissions [puppet] - 10https://gerrit.wikimedia.org/r/616076 (https://phabricator.wikimedia.org/T179317) [12:30:18] (03PS1) 10Muehlenhoff: Add component/lilypond [puppet] - 10https://gerrit.wikimedia.org/r/616077 (https://phabricator.wikimedia.org/T256877) [12:31:25] (03CR) 10Vgutierrez: wikidough: set TLSv1.2 as the minimum version for DoT (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:32:07] (03PS2) 10Jbond: admin: drop strace/tcpdump permissions [puppet] - 10https://gerrit.wikimedia.org/r/616076 (https://phabricator.wikimedia.org/T179317) [12:33:51] (03CR) 10Muehlenhoff: [C: 03+2] Add component/lilypond [puppet] - 10https://gerrit.wikimedia.org/r/616077 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [12:34:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff) [12:34:45] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [12:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:55] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [12:47:42] (03CR) 10Ssingh: wikidough: set TLSv1.2 as the minimum version for DoT (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:48:00] (03PS3) 10Ssingh: wikidough: set TLSv1.2 as the minimum version for DoT [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) [12:50:01] (03PS1) 10Elukey: collector.py: import json and add more logging for the user [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/616079 [12:50:32] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/24125/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:51:26] (03PS2) 10Elukey: collector.py: import json and add more logging for the user [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/616079 [12:52:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] collector.py: import json and add more logging for the user [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/616079 (owner: 10Elukey) [12:54:09] (03PS2) 10Muehlenhoff: Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling) [12:58:06] (03CR) 10Elukey: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff) [13:00:05] (03PS1) 10Jbond: admin: varnishncsa/varnishlog run commands as varnish user [puppet] - 10https://gerrit.wikimedia.org/r/616080 (https://phabricator.wikimedia.org/T179317) [13:00:27] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.1/includes/import/ImportableOldRevisionImporter.php: [[gerrit:616029|Import: use master DB for loading slots.]] (T258666) (duration: 01m 07s) [13:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:33] T258666: RevisionAccessException when trying to import files with FileImporter - https://phabricator.wikimedia.org/T258666 [13:02:39] (03PS1) 10Urbanecm: Increase wgAbuseFilterEmergencyDisable for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) [13:17:19] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:23] (03PS5) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [13:27:32] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10MoritzMuehlenhoff) Fantastic :-) [13:31:21] !log advertise 185.71.138.0/24 from AMS [13:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 (owner: 10Ema) [13:34:34] (03CR) 10CDanis: "Discussed some with Joe as he got nerdsniped despite this being a vacation day for him." [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis) [13:34:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, this group is current not needed and if it's ever needed again, we can define the permissions as needed." [puppet] - 10https://gerrit.wikimedia.org/r/616058 (owner: 10Jbond) [13:35:49] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [13:35:54] (03CR) 10Jbond: [C: 03+2] admin: drop pentester privileges [puppet] - 10https://gerrit.wikimedia.org/r/616058 (owner: 10Jbond) [13:36:00] (03PS2) 10Jbond: admin: drop pentester privileges [puppet] - 10https://gerrit.wikimedia.org/r/616058 [13:36:03] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) p:05Triage→03High [13:36:04] (03CR) 10A2093064: [C: 04-1] Increase wgAbuseFilterEmergencyDisable for zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) (owner: 10Urbanecm) [13:37:14] (03CR) 10Vgutierrez: [C: 03+1] Go mod configuration [software/atskafka] - 10https://gerrit.wikimedia.org/r/616021 (owner: 10Ema) [13:37:55] (03CR) 10Muehlenhoff: [C: 03+1] "Needs some rebase for the pentesters change (as already removed previously), +1 on the mediawiki-testers part." [puppet] - 10https://gerrit.wikimedia.org/r/616076 (https://phabricator.wikimedia.org/T179317) (owner: 10Jbond) [13:38:58] (03PS3) 10Jbond: admin: drop strace/tcpdump permissions [puppet] - 10https://gerrit.wikimedia.org/r/616076 (https://phabricator.wikimedia.org/T179317) [13:39:42] (03CR) 10Vgutierrez: [C: 03+1] Use testify for testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/616022 (owner: 10Ema) [13:41:06] (03CR) 10Jbond: [C: 03+2] admin: drop strace/tcpdump permissions [puppet] - 10https://gerrit.wikimedia.org/r/616076 (https://phabricator.wikimedia.org/T179317) (owner: 10Jbond) [13:41:25] (03PS1) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [13:41:55] (03PS1) 10Ayounsi: Add 185.71.138.0/24 to bgp_out in esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/616102 [13:41:57] (03PS1) 10Ayounsi: Add 185.71.138.0/24 to wikimedia4 [homer/public] - 10https://gerrit.wikimedia.org/r/616103 [13:42:01] (03PS10) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:42:38] (03CR) 10jerkins-bot: [V: 04-1] Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [13:43:23] (03CR) 10Ayounsi: [C: 03+2] Add 185.71.138.0/24 to bgp_out in esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/616102 (owner: 10Ayounsi) [13:43:33] (03CR) 10Kormat: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24126/" [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [13:43:56] (03Merged) 10jenkins-bot: Add 185.71.138.0/24 to bgp_out in esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/616102 (owner: 10Ayounsi) [13:44:34] (03PS2) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [13:45:18] (03PS11) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:46:41] (03PS6) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [13:46:45] (03CR) 10Vgutierrez: [C: 03+1] wikidough: set TLSv1.2 as the minimum version for DoT [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:48:01] (03CR) 10Jbond: admins: let wdqs-admins run jstack as root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [13:51:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'cloudcephosd1004' - https://phabricator.wikimedia.org/T258764 (10Andrew) Thank you @ayounsi -- looks like i had half of this right but was m... [13:52:14] (03PS3) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [13:53:53] (03PS7) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [13:54:03] hi valentin, i noticed a privlage escalation with varnishncsa and our sudo config, added you to a change to plug it (https://gerrit.wikimedia.org/r/c/operations/puppet/+/616080) [13:54:31] hmmm wrong window? ;P [13:54:42] yes [13:54:51] * vgutierrez checking :D [13:56:20] (03CR) 10Privacybatm: "This change is ready for review." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [13:56:22] (03PS4) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [13:57:23] (03CR) 10Vgutierrez: [C: 03+1] admin: varnishncsa/varnishlog run commands as varnish user [puppet] - 10https://gerrit.wikimedia.org/r/616080 (https://phabricator.wikimedia.org/T179317) (owner: 10Jbond) [13:59:34] (03PS5) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [14:00:23] (03PS2) 10Jbond: admin: varnishncsa/varnishlog run commands as varnish user [puppet] - 10https://gerrit.wikimedia.org/r/616080 (https://phabricator.wikimedia.org/T179317) [14:03:57] (03CR) 10Jbond: [C: 03+2] admin: varnishncsa/varnishlog run commands as varnish user [puppet] - 10https://gerrit.wikimedia.org/r/616080 (https://phabricator.wikimedia.org/T179317) (owner: 10Jbond) [14:06:17] (03PS8) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) [14:07:48] (03CR) 10Muehlenhoff: [C: 03+2] Add an-tool1009 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/616057 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [14:09:01] (03PS7) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [14:11:09] (03CR) 10JMeybohm: helmfile: strawman refactoring (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:13:59] (03PS1) 10CDanis: Add in-addr.arpa reverse DNS zone for our new 185.71.138.0/24 [dns] - 10https://gerrit.wikimedia.org/r/616106 [14:16:00] (03CR) 10Ayounsi: [C: 03+1] "LGTM comparing to other zone files." [dns] - 10https://gerrit.wikimedia.org/r/616106 (owner: 10CDanis) [14:17:24] (03CR) 10CDanis: [C: 03+2] "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/616106 (owner: 10CDanis) [14:20:48] (03CR) 10Ssingh: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:27:43] (03PS1) 10ZPapierski: Use correct UriScheme in Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) [14:29:04] (03CR) 10jerkins-bot: [V: 04-1] Use correct UriScheme in Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski) [14:30:07] !log reedy@deploy1001 Started scap: Score backports [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:31] (03PS3) 10Jbond: admin: add check to detect vi, vim and view in sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/616048 [14:53:50] (03PS2) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [14:54:01] (03PS2) 10ZPapierski: Use correct UriScheme in Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) [14:54:06] (03CR) 10Hnowlan: "As mentioned I'm a bit nervous about the admin interface, but the check in the latest patchset correctly checks web status on the current " [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:54:11] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add namespace for api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/615521 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:55:32] 10Operations, 10TechCom-RFC, 10Traffic, 10MobileFrontend (Tracking), 10Readers-Web-Backlog (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Jdlrobson) [15:06:58] !log reedy@deploy1001 Finished scap: Score backports (duration: 36m 50s) [15:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:51] (03CR) 10Ema: [C: 03+1] admin: varnishncsa/varnishlog run commands as varnish user [puppet] - 10https://gerrit.wikimedia.org/r/616080 (https://phabricator.wikimedia.org/T179317) (owner: 10Jbond) [15:09:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616048 (owner: 10Jbond) [15:11:47] (03PS6) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [15:12:41] (03CR) 10Jbond: [C: 03+2] admin: add check to detect vi, vim and view in sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/616048 (owner: 10Jbond) [15:13:27] (03Abandoned) 10Jbond: admin: add bad sudo commands to test CI [puppet] - 10https://gerrit.wikimedia.org/r/616049 (owner: 10Jbond) [15:14:34] (03PS2) 10Jbond: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [15:15:28] (03CR) 10jerkins-bot: [V: 04-1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [15:16:20] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10ema) >>! In T195981#6329235, @jbond wrote: >>>! In T195981#5635590, @jbond wrote: >> I attempted a [[ https://github.com/puppetlabs/puppet/pull/7802 | patch for this... [15:25:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [15:41:50] (03CR) 10Andrew Bogott: [C: 03+2] WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [15:43:24] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: contint2001.wikimedia.org, cloudcephosd1011.eqiad.wmnet, contint1001.wikimedia.org, testreduce1001.eqiad.wmnet, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:53:19] (03PS1) 10Andrew Bogott: Make cloudcephosd1005.eqiad.wmnet a ceph node [puppet] - 10https://gerrit.wikimedia.org/r/616115 (https://phabricator.wikimedia.org/T251619) [15:53:48] (03PS1) 10Herron: logstash: move normalize log level filter to separate mutate block [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) [15:54:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:55:28] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcephosd1005.eqiad.wmnet a ceph node [puppet] - 10https://gerrit.wikimedia.org/r/616115 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [15:58:11] (03PS1) 10Andrew Bogott: Fix naming of cloudcephosd1004 an site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 [15:58:20] (03PS23) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [15:58:47] (03CR) 10jerkins-bot: [V: 04-1] Fix naming of cloudcephosd1004 an site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 (owner: 10Andrew Bogott) [15:59:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:59:26] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:59:46] (03PS2) 10Andrew Bogott: Fix naming of cloudcephosd1004 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 [16:00:11] (03CR) 10jerkins-bot: [V: 04-1] Fix naming of cloudcephosd1004 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 (owner: 10Andrew Bogott) [16:01:17] (03PS3) 10Andrew Bogott: Fix naming of cloudcephosd1004 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 (https://phabricator.wikimedia.org/T251619) [16:02:17] (03CR) 10Andrew Bogott: [C: 03+2] Fix naming of cloudcephosd1004 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/616119 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [16:02:36] (03CR) 10RLazarus: [C: 03+1] appserver hiera: nginx is no more, long live envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/615874 (owner: 10CDanis) [16:17:45] (03PS1) 10Lucas Werkmeister (WMDE): Prevent onTitleGetRestrictionTypes changing ns0 protections [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616032 (https://phabricator.wikimedia.org/T258323) [16:18:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Prevent onTitleGetRestrictionTypes changing ns0 protections [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616032 (https://phabricator.wikimedia.org/T258323) (owner: 10Lucas Werkmeister (WMDE)) [16:20:45] (03CR) 10Hnowlan: api-gateway: Basic envoy chart (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:21:41] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [16:22:40] I’ll be deploying a wmf.1 backport soon [16:24:16] (correction, Amir1 says he’ll do the deploy) [16:24:28] Lucas_WMDE: ack [16:27:59] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Allow large POST to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760) (owner: 10BryanDavis) [16:28:56] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) (owner: 10BryanDavis) [16:29:25] (03PS1) 10Hnowlan: api-gateway: proxy clusters interface through Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254906) [16:29:32] (03PS2) 10Andrew Bogott: toolforge: Allow large POST to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760) (owner: 10BryanDavis) [16:31:22] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: proxy clusters interface through Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:31:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1013.eqiad.wmnet', 'c... [16:33:08] (03CR) 10Reedy: [C: 03+1] "WFM :)" [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) (owner: 10BryanDavis) [16:34:27] (03CR) 10CDanis: [C: 03+2] "I will probably also write a followup patch that renames this sub-key from 'services' to 'systemd_services' because I got volansed by the " [puppet] - 10https://gerrit.wikimedia.org/r/615874 (owner: 10CDanis) [16:35:54] (03PS4) 10Andrew Bogott: Add Cloud VPS global root key for Sam Reed [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) (owner: 10BryanDavis) [16:36:03] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add Cloud VPS global root key for Sam Reed [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) (owner: 10BryanDavis) [16:40:11] (03Merged) 10jenkins-bot: Prevent onTitleGetRestrictionTypes changing ns0 protections [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616032 (https://phabricator.wikimedia.org/T258323) (owner: 10Lucas Werkmeister (WMDE)) [16:40:19] (03PS3) 10Dzahn: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) [16:41:35] (03CR) 10RLazarus: "John, thanks for catching that." [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [16:46:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10wiki_willy) [16:47:41] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Wikibase/repo/includes/WikibaseRepo.php: [[gerrit:616032|Prevent onTitleGetRestrictionTypes changing ns0 protections]], Part I (duration: 01m 06s) [16:47:44] (03CR) 10Dzahn: "@dcausse This looks like it works to me, where 46484 is a blazegraph PID:" [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [16:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:30] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Wikibase/repo/includes/RepoHooks.php: [[gerrit:616032|Prevent onTitleGetRestrictionTypes changing ns0 protections]], Part II (duration: 01m 07s) [16:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:25] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn) @dcausse As John pointed out on Gerrit the (only) java process running on wdqs machines is run... [16:55:04] (03PS1) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/616124 [16:55:36] !log deployment done [16:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:25] (03PS2) 10Hnowlan: api-gateway: proxy clusters interface through Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254906) [16:58:52] (03CR) 10Hnowlan: "Tested with `check_http -H api.wikimedia.org -I 127.0.0.1 -p 7000 -u /clusters -r "webserver_cluster.*health_flags::healthy"` after forwar" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:59:03] (03CR) 10Hnowlan: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:59:21] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: proxy clusters interface through Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:59:23] (03CR) 10Dzahn: "> Maybe it would be best to just handle this with file permissions after all. I'm sensitive to the "central location in the admins module"" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [17:01:35] (03PS1) 10Jdlrobson: Enable desktop click tracking instrumentation on (fr,he,fa)wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616125 (https://phabricator.wikimedia.org/T258058) [17:03:38] (03CR) 10Dzahn: admins: let wdqs-admins view nginx logs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [17:07:57] jbond42: that's the first time i got a full video as a code review comment. very nice:) thanks! [17:08:14] (03CR) 10Herron: "My thinking here is that giving the kibana 7 backends a long-lived name would make the 5->7 cutover, potential rollback, and future upgrad" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [17:08:19] (03PS3) 10Jdlrobson: Switch test wikis to new version of vector by default (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) [17:08:21] (03PS3) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [17:22:53] (03CR) 10Dzahn: [C: 03+1] "since Valentin already reviewed the ciphers list and the puppet part compiles, this looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:28:05] (03CR) 10Ssingh: [C: 03+2] wikidough: set TLSv1.2 as the minimum version for DoT [puppet] - 10https://gerrit.wikimedia.org/r/616067 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:31:15] 10Operations, 10serviceops: wtp1025's root partition full - https://phabricator.wikimedia.org/T258775 (10wkandek) Back to 76%: php-1.36.0-wmf.1 and .41 are now way smaller. ` wkandek@wtp1025:/srv/mediawiki$ df -k Filesystem 1K-blocks Used Available Use% Mounted on udev 32927420 0 32... [17:34:29] (03CR) 10CDanis: "I don't think this is necessary. I only had a long-running tmux there because I left it by accident when debugging something two weeks ag" [puppet] - 10https://gerrit.wikimedia.org/r/615826 (owner: 10Dzahn) [17:36:56] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10dcausse) I could not use this command last time and it did not work I think because the jvm was too b... [17:52:33] (03Abandoned) 10Dzahn: do not monitor long-running screens on weblog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/615826 (owner: 10Dzahn) [17:57:13] (03CR) 10Dzahn: "As stated by David on https://phabricator.wikimedia.org/T258739#6333879 he needs jstacks -F which needs root. I can confirm with -F as bla" [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [18:05:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:16:30] (03PS1) 10Ryan Kemper: Revert "Revert "[wdqs] add a new streaming updater profile"" [puppet] - 10https://gerrit.wikimedia.org/r/616036 [18:16:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:17:44] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "[wdqs] add a new streaming updater profile"" [puppet] - 10https://gerrit.wikimedia.org/r/616036 (owner: 10Ryan Kemper) [18:18:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:27:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:27:17] (03Abandoned) 10Ryan Kemper: Revert "Revert "[wdqs] add a new streaming updater profile"" [puppet] - 10https://gerrit.wikimedia.org/r/616036 (owner: 10Ryan Kemper) [18:27:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:29:00] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:33:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:54:42] (03PS2) 10Dzahn: aphlict: remove requirement of the phab_deploy_finalize script [puppet] - 10https://gerrit.wikimedia.org/r/615879 (https://phabricator.wikimedia.org/T238593) [18:55:53] (03PS3) 10Dzahn: aphlict: remove requirement of the phab_deploy_finalize script [puppet] - 10https://gerrit.wikimedia.org/r/615879 (https://phabricator.wikimedia.org/T238593) [19:00:07] (03CR) 10Ryan Kemper: "Looks good, I'll get approval for the sudoers change in this upcoming monday's SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [19:04:45] (03CR) 10Ryan Kemper: "CC Jbond and moritzm for sudoers approval (entry added to upcoming sre meeting doc as well)" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [19:08:44] (03CR) 10Dzahn: [C: 03+2] aphlict: remove requirement of the phab_deploy_finalize script [puppet] - 10https://gerrit.wikimedia.org/r/615879 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:36:33] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:42:02] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:47:26] (03PS1) 10Andrew Bogott: Add ips in cloud-hosts1-b-eqiad for cloudcephmon nodes [dns] - 10https://gerrit.wikimedia.org/r/616150 (https://phabricator.wikimedia.org/T258826) [19:47:28] (03PS1) 10Andrew Bogott: Remove public IP addresses for cloudcephmons nodes [dns] - 10https://gerrit.wikimedia.org/r/616151 (https://phabricator.wikimedia.org/T258826) [19:47:30] (03PS1) 10Andrew Bogott: Remove eth1 addresses for cloudcephmon hosts [dns] - 10https://gerrit.wikimedia.org/r/616152 (https://phabricator.wikimedia.org/T258826) [19:48:46] (03PS1) 10Andrew Bogott: Move cloudcephmon hosts from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616153 (https://phabricator.wikimedia.org/T258826) [19:52:01] (03PS1) 10Dzahn: phabricator/aphlict: set base_dir parameter when using aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616154 (https://phabricator.wikimedia.org/T238593) [19:52:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [19:52:53] (03PS2) 10Dzahn: phabricator/aphlict: set base_dir parameter when using aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616154 (https://phabricator.wikimedia.org/T238593) [19:54:12] (03PS2) 10Andrew Bogott: Add ips in cloud-hosts1-b-eqiad for cloudcephmon nodes [dns] - 10https://gerrit.wikimedia.org/r/616150 (https://phabricator.wikimedia.org/T258826) [19:54:14] (03PS2) 10Andrew Bogott: Remove public IP addresses for cloudcephmons nodes [dns] - 10https://gerrit.wikimedia.org/r/616151 (https://phabricator.wikimedia.org/T258826) [19:54:16] (03PS2) 10Andrew Bogott: Remove eth1 addresses for cloudcephmon hosts [dns] - 10https://gerrit.wikimedia.org/r/616152 (https://phabricator.wikimedia.org/T258826) [19:54:57] (03CR) 10Andrew Bogott: [C: 03+2] Add ips in cloud-hosts1-b-eqiad for cloudcephmon nodes [dns] - 10https://gerrit.wikimedia.org/r/616150 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [19:55:16] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@772b4a3]: Deploy CLs 611465 and 613740 to add compression support to ArcLamp [19:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:21] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@772b4a3]: Deploy CLs 611465 and 613740 to add compression support to ArcLamp (duration: 00m 05s) [19:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:42] (03CR) 10Dzahn: [C: 03+2] phabricator/aphlict: set base_dir parameter when using aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616154 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:57:52] !log Manually gzipping some older ArcLamp data on webperf1002, to free up space and verify new compression support. [19:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:34] (03PS2) 10Andrew Bogott: Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616153 (https://phabricator.wikimedia.org/T258826) [20:02:37] (03PS1) 10Andrew Bogott: Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616156 (https://phabricator.wikimedia.org/T258826) [20:02:39] (03PS1) 10Andrew Bogott: Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616157 (https://phabricator.wikimedia.org/T258826) [20:04:56] (03PS1) 10Ahmon Dancy: Update blubberoid to 2020-07-24-194337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616158 (https://phabricator.wikimedia.org/T254629) [20:06:34] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:54] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616153 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [20:09:37] (03PS1) 10Dzahn: access: let existing phabricator admins get on aphlict machine [puppet] - 10https://gerrit.wikimedia.org/r/616159 (https://phabricator.wikimedia.org/T238593) [20:11:06] (03PS1) 10Dzahn: phabricator: fix the basedir/base_dir parameter name for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616160 (https://phabricator.wikimedia.org/T238593) [20:11:19] (03PS2) 10Dzahn: access: let existing phabricator admins get on aphlict machine [puppet] - 10https://gerrit.wikimedia.org/r/616159 (https://phabricator.wikimedia.org/T238593) [20:13:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1011.eqiad.wmnet'] `... [20:14:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne... [20:14:33] (03PS3) 10Dzahn: access: let existing phabricator-root-admins get on aphlict machine [puppet] - 10https://gerrit.wikimedia.org/r/616159 (https://phabricator.wikimedia.org/T238593) [20:20:04] (03CR) 10Dzahn: [C: 03+2] access: let existing phabricator-root-admins get on aphlict machine [puppet] - 10https://gerrit.wikimedia.org/r/616159 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:20:53] (03PS2) 10Dzahn: phabricator: fix the basedir/base_dir parameter name for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616160 (https://phabricator.wikimedia.org/T238593) [20:22:05] (03CR) 10Dzahn: [C: 03+2] phabricator: fix the basedir/base_dir parameter name for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/616160 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:28:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1012.eqiad.wmnet', 'c... [20:30:17] (03PS1) 10Dzahn: admins: remove demon from gerrit and phab root users [puppet] - 10https://gerrit.wikimedia.org/r/616164 [20:35:29] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10Jclark-ctr) @CDanis connected console to port 41 on scs-c1-eqiad [20:37:29] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:41:00] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10Jclark-ctr) 05Open→03Resolved [20:41:02] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10Jclark-ctr) [20:41:55] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10Jclark-ctr) Connected console port to scs-c1-eqiad updated netbox with connection [20:43:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:44:27] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) (owner: 10Herron) [20:46:28] (03PS1) 10Dzahn: aphlict: require php-cli package [puppet] - 10https://gerrit.wikimedia.org/r/616165 (https://phabricator.wikimedia.org/T238593) [20:46:52] (03CR) 10Cwhite: [C: 03+1] "The approach is certainly an improvement. LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [20:49:57] (03CR) 10Dzahn: [C: 03+2] aphlict: require php-cli package [puppet] - 10https://gerrit.wikimedia.org/r/616165 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:50:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:50:52] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10wiki_willy) a:03Cmjohnson [20:53:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:45] (03PS2) 10Andrew Bogott: Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616156 (https://phabricator.wikimedia.org/T258826) [20:54:47] (03PS2) 10Andrew Bogott: Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616157 (https://phabricator.wikimedia.org/T258826) [20:54:49] (03PS1) 10Andrew Bogott: Add site.pp entries for cloudcephmon1001-1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616168 (https://phabricator.wikimedia.org/T258826) [20:55:54] (03CR) 10Andrew Bogott: [C: 03+2] Add site.pp entries for cloudcephmon1001-1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616168 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [21:07:34] 10Operations, 10LDAP-Access-Requests: Add DVrandecic to superuser and turnilo wmf group - https://phabricator.wikimedia.org/T258837 (10DVrandecic) [21:13:19] (03PS1) 10Andrew Bogott: Revert "Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/616038 [21:15:23] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/616038 (owner: 10Andrew Bogott) [21:32:20] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10wiki_willy) a:03Papaul [21:42:26] (03CR) 10DannyS712: [C: 03+1] labs: Increase wgAbuseFilterEmergencyDisableThreshold for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) (owner: 10Urbanecm) [22:04:36] (03PS1) 10Dzahn: ssl: add certificate for aphlict.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616171 (https://phabricator.wikimedia.org/T238593) [22:07:15] (03CR) 10Dzahn: [C: 03+2] ssl: add certificate for aphlict.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616171 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:08:20] (03PS2) 10Dzahn: ssl: add certificate for aphlict.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616171 (https://phabricator.wikimedia.org/T238593) [22:10:07] (03PS1) 10Andrew Bogott: Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616172 (https://phabricator.wikimedia.org/T258826) [22:10:41] (03CR) 10Dzahn: [C: 03+2] ssl: add certificate for aphlict.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616171 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:12:11] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1127.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:40] (03PS1) 10Dzahn: add fake key for aphlict.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/616173 (https://phabricator.wikimedia.org/T238593) [22:14:22] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for aphlict.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/616173 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:14:39] (03PS2) 10Dzahn: add fake key for aphlict.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/616173 (https://phabricator.wikimedia.org/T238593) [22:30:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:33:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:42:37] (03PS1) 10Dave Pifke: [WIP] arclamp: run arclamp-compress-logs [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T256035) [22:47:40] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for aphlict.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/616173 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:48:34] (03PS2) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) [23:00:04] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:43] ^ This failure is for a data reload on a test instance, so there's no actual production problem here just to be explicit. We'll troubleshoot/kick off the reload again on Monday [23:31:31] (03PS1) 10Jared Blumer: eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 [23:35:17] (03PS2) 10Jared Blumer: eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) [23:39:41] (03CR) 10A2093064: [C: 03+1] labs: Increase wgAbuseFilterEmergencyDisableThreshold for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) (owner: 10Urbanecm) [23:53:47] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:57:37] (03PS1) 10Dzahn: aphlict: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/616184 (https://phabricator.wikimedia.org/T238593) [23:59:53] (03PS2) 10Dzahn: aphlict: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/616184 (https://phabricator.wikimedia.org/T238593)