[00:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T0000).
[00:00:05] <jouncebot>	 legoktm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:19] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521)
[00:00:23] <legoktm>	 guess it's just me
[00:00:50] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664649 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:00:53] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664650 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:01:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:02:21] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE
[00:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:04:26] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE
[00:04:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:52] <wikibugs>	 (03Merged) 10jenkins-bot: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664649 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664650 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm)
[00:09:41] <legoktm>	 https://test.wikipedia.org/w/index.php?title=EasyTimeline&type=revision&diff=466643&oldid=466221
[00:09:45] <legoktm>	 it works :)
[00:11:46] <legoktm>	 and on wmf.30 too, just tested in preview though
[00:13:32] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/timeline.php: Set $wgTimelineFontDirectory (T274822) (duration: 01m 05s)
[00:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:40] <stashbot>	 T274822: [EasyTimeline] No text included / no font rendered / displayed at all in PNG graph output - https://phabricator.wikimedia.org/T274822
[00:15:47] <logmsgbot>	 !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 02s)
[00:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:36] <icinga-wm>	 RECOVERY - Long running screen/tmux on centrallog1001 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[00:17:19] <logmsgbot>	 !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 06s)
[00:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:18:58] <icinga-wm>	 PROBLEM - Host cloudnet1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:27:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet
[00:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:26] <icinga-wm>	 RECOVERY - Host cloudnet1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms
[00:31:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:56] <mutante>	 !log mw1351 - powercycled
[00:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:50] <wikibugs>	 (03CR) 10Legoktm: "I explained how I tested this at T273521#6835886." [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm)
[00:39:50] <icinga-wm>	 RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops
[00:49:13] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1351.eqiad.wmnet
[00:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:02] <icinga-wm>	 RECOVERY - Long running screen/tmux on mwdebug1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[00:56:28] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Remove hhvm reference in mw-cgroup unit [puppet] - 10https://gerrit.wikimedia.org/r/664688
[00:57:46] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1345.eqiad.wmnet'] `  an...
[00:58:35] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1345.eqiad.wmnet
[00:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:17] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet
[01:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet
[01:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:00] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1351.eqiad.wmnet
[01:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:23] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1345.eqiad.wmnet
[01:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:16] <wikibugs>	 (03PS1) 10Dzahn: mcrouter_wancache: move mcrouter proxy from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757)
[01:19:50] <wikibugs>	 (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757)
[01:23:55] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757)
[01:26:50] <mutante>	 yes
[01:27:13] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757)
[01:37:29] <icinga-wm>	 PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[01:39:17] <mutante>	 !log mwdebug1001 - rebooting
[01:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:40:15] <icinga-wm>	 RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[01:41:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet
[01:41:53] <mutante>	 !log mwdebug1001 - back on buster and pooled
[01:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:33] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[02:56:21] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:43] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:23:57] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 5.509 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:29:01] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:37:15] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:54:13] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:12:07] <wikibugs>	 (03CR) 10Legoktm: arclamp: add excimer-real pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) (owner: 10Dave Pifke)
[04:38:39] <icinga-wm>	 PROBLEM - Long running screen/tmux on centrallog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 12365, 7305491s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[04:43:47] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Papaul) @Jclark-ctr no good news, you will have to try to use a DVD.  Thanks
[04:50:38] <wikibugs>	 10SRE, 10serviceops, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle)
[04:52:51] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Sustainability (Incident Followup), 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Krinkle)
[04:59:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 385054056912 and 446944 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:11:48] <wikibugs>	 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle) >>! In T272215#6755992, @jcrespo wrote: > More details are yet to be provided on the Incident report, I can help with that once the right...
[05:14:44] <wikibugs>	 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle)
[06:10:02] <marostegui>	 In around 50 minutes I will be restarting x1 master (daemon restart)
[06:18:43] <wikibugs>	 (03PS1) 10Marostegui: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361)
[06:20:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:24:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this patch does what we want, but I would love to generalize what you did a bit (in the nginx configuration). That should be done " [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm)
[06:32:15] <wikibugs>	 (03CR) 10Marostegui: "This was enabling notifications" [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:35:29] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664723 (https://phabricator.wikimedia.org/T258361)
[06:36:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664723 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:39:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1172 to dbctl, but not pooled yet T258361', diff saved to https://phabricator.wikimedia.org/P14385 and previous config saved to /var/cache/conftool/dbconfig/20210217-063915-marostegui.json
[06:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:22] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:52:05] <marostegui>	 In around 10 minutes I will be restarting x1 master (daemon restart)
[07:00:04] <marostegui>	 !log Restart db1103 (x1) primary master - T273758
[07:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:09] <stashbot>	 T273758: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758
[07:01:26] <marostegui>	 !log Restart db1103 (x1) primary master DONE - T273758
[07:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:04:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet
[07:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 19 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:06:06] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) This was done: Master down: 07:00:09 Master up: 07:01:24
[07:06:44] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui)
[07:07:00] <elukey>	 marostegui: I was about to ask, anything happening to db1103.eqiad.wmnet ? :D
[07:07:11] <marostegui>	 yep, the restart :)
[07:07:21] <elukey>	 yes yes all good, thanks :)
[07:07:29] <marostegui>	 thanks :**
[07:13:14] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) Closing this as fixed: ` # mysql -e "select @@report_host" +--------------------+ | @@report_host      | +--------------------+ | db1103.eqiad.wmnet |...
[07:13:23] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) 05Open→03Resolved
[07:16:27] <marostegui>	 !log Add x1 to orchestrator
[07:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet
[07:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1172 in s8 for the first time - T258361', diff saved to https://phabricator.wikimedia.org/P14386 and previous config saved to /var/cache/conftool/dbconfig/20210217-072131-marostegui.json
[07:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:37] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[07:22:22] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet
[07:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:14] <wikibugs>	 (03PS10) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[07:30:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet
[07:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet
[07:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet
[07:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14387 and previous config saved to /var/cache/conftool/dbconfig/20210217-074107-marostegui.json
[07:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:12] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[07:43:15] <wikibugs>	 (03PS1) 10Ladsgroup: wikilabels: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/664752 (https://phabricator.wikimedia.org/T273673)
[07:46:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The code looks ok, but I don't see an additional checking that the new metrics act like expected. You should add at least one test in medi" [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan)
[07:48:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Before merging I'd check that the service can be restarted with no harm caused to live requests." [puppet] - 10https://gerrit.wikimedia.org/r/664688 (owner: 10Legoktm)
[07:59:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: set vega.enabled: false by default [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron)
[07:59:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: update home dashboard to Grafana 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/664555 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi)
[08:00:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo)
[08:02:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Georgina Burnett to wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/664772 (https://phabricator.wikimedia.org/T273780)
[08:04:01] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) >>! In T274488#6835686, @Jclark-ctr wrote: > @fgiunchedi  would you be ok with chassis swap using ms-be1018 recently decommissioned?  Yes, please proceed
[08:04:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Georgina Burnett to wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/664772 (https://phabricator.wikimedia.org/T273780) (owner: 10Muehlenhoff)
[08:04:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet
[08:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:13] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10fgiunchedi) 05Open→03Resolved LGTM, thank you @Papaul
[08:06:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10MoritzMuehlenhoff) 05Open→03Resolved @georginaburnett-wmde : Your access has been enabled. I'm closing the task, please reopen if you run into any issues!
[08:07:16] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet
[08:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10MoritzMuehlenhoff) Ack, this needs approval/discussion in the next SRE meeting since it would create a new access group.
[08:11:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:12:06] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[08:13:52] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[08:16:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MoritzMuehlenhoff) @MattCleinman : Hi, in this case for access to Superset we don't need your SSH key (but in fact you need to be added to analytics-privatedata-users).  This needs approval from the f...
[08:16:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:23:16] <wikibugs>	 (03CR) 10Jcrespo: "Andrew, this is galera, which only cloud use, so I don't have a say on it." [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[08:27:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] Openstack control node galera: send mariadb logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[08:27:48] <wikibugs_>	 (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo)
[08:37:38] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 117 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:37:52] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:40:03] <wikibugs>	 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10jcrespo) I personally don't feel capable neither to write proper docs, file follow ups nor to close it. When I said "more details are yet to be pr...
[08:40:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10MoritzMuehlenhoff) @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you to analytics-privatedata-users:  * Your m...
[08:41:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14388 and previous config saved to /var/cache/conftool/dbconfig/20210217-084120-marostegui.json
[08:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:27] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[08:44:09] <marostegui>	 !log upgrade es2020 es2021 es2022's kernel
[08:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:39] <wikibugs>	 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) >>! In T272215#6836259, @Krinkle wrote: >>>! In T272215#6755992, @jcrespo wrote: >> More details are yet to be provided on the Incident repor...
[08:49:08] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 8 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:49:18] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:54:14] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:54:18] <wikibugs>	 (03Abandoned) 10Kosta Harlan: linkrecommendation: Disable cron job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:56:27] <wikibugs>	 10SRE, 10GitLab, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10Aklapper)
[08:59:19] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774
[09:05:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 (owner: 10Muehlenhoff)
[09:10:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan)
[09:23:53] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315)
[09:24:20] <icinga-wm>	 PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:11] <wikibugs>	 10ops-eqiad, 10Analytics: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10elukey)
[09:27:50] <icinga-wm>	 RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:47] <elukey>	 !log reboot dbstore100[3-5] for kernel upgrades
[09:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:49] <elukey>	 (stop replicas; stop mariadb instances; umount /srv; reboot; etc..)
[09:34:26] <kormat>	 elukey: do `swapoff -a` too, before reboot
[09:34:36] <kormat>	 (if it's not already too late)
[09:35:46] <elukey>	 kormat: too late for 1003 but I'll do it for the others thanks :)
[09:35:59] <kormat>	 reclaiming swap during reboot is sometimes paaainfully slow. i've had db machines hang for 20-30 minutes when all i wanted was a nice quick reboot
[09:36:19] <elukey>	 +1 right
[09:36:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan)
[09:37:59] <wikibugs>	 (03PS1) 10Kormat: WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779
[09:40:02] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774 (owner: 10Marostegui)
[09:41:11] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779 (owner: 10Kormat)
[09:42:04] <wikibugs>	 (03PS11) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[09:42:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[09:42:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Initial stub role for rootless Cumin [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840)
[09:43:32] <wikibugs>	 (03Merged) 10jenkins-bot: WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779 (owner: 10Kormat)
[09:43:36] <wikibugs>	 (03PS1) 10Jbond: cloud idp: update mapped attributes [puppet] - 10https://gerrit.wikimedia.org/r/664781
[09:43:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774 (owner: 10Marostegui)
[09:44:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "If puppet compiler is happy go for it, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[09:44:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud idp: update mapped attributes [puppet] - 10https://gerrit.wikimedia.org/r/664781 (owner: 10Jbond)
[09:45:54] <wikibugs>	 (03PS2) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315)
[09:53:05] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Unless I'm missing something I think there is a flaw in the current approach." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[09:55:18] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[09:55:43] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/28103/" [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[09:55:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Initial stub role for rootless Cumin [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[09:56:34] <wikibugs>	 (03PS1) 10DCausse: [wdqs] disable fetching constraints [puppet] - 10https://gerrit.wikimedia.org/r/664782 (https://phabricator.wikimedia.org/T274982)
[09:59:02] <wikibugs>	 (03PS3) 10Jbond: P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782)
[09:59:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debug_host: calculate the correct realm [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/664598 (owner: 10Jbond)
[10:01:15] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[10:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:38] <icinga-wm>	 PROBLEM - MariaDB Replica IO: x1 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:01:58] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:02:02] <icinga-wm>	 PROBLEM - MariaDB read only x1 on dbstore1005 is CRITICAL: Could not connect to localhost:3320 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:02:14] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:26] <icinga-wm>	 PROBLEM - MariaDB Replica IO: staging on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:32] <icinga-wm>	 PROBLEM - MariaDB read only s8 on dbstore1005 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:02:36] <icinga-wm>	 PROBLEM - MariaDB read only staging on dbstore1005 is CRITICAL: Could not connect to localhost:3350 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:02:40] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:42] <icinga-wm>	 PROBLEM - MariaDB read only s6 on dbstore1005 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:02:46] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:52] <marostegui>	 elukey: ^
[10:02:56] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:02:57] <marostegui>	 I guess that's you rebooting?
[10:02:58] <icinga-wm>	 PROBLEM - mysqld processes on dbstore1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:03:08] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: x1 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:03:18] <elukey>	 marostegui: yep downtime expired probably
[10:03:47] <elukey>	 sorry for the noise
[10:03:49] <marostegui>	 elukey: coool, nothing to worry about then :)
[10:04:04] <icinga-wm>	 RECOVERY - MariaDB Replica IO: staging on dbstore1005 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:04:05] <kormat>	 elu downtimes-expire-so-quickly-why-even-create-them key
[10:04:18] <icinga-wm>	 RECOVERY - MariaDB read only staging on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 13s, read_only: False, event_scheduler: True, 15.61 QPS, connection latency: 0.002788s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:04:38] <icinga-wm>	 RECOVERY - mysqld processes on dbstore1005 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:05:01] <elukey>	 kormat: I put half an hour for the three of them and of course I got distracted right after the last reboot :D
[10:05:10] <kormat>	 hehe
[10:05:24] <icinga-wm>	 RECOVERY - MariaDB read only x1 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 98s, read_only: True, event_scheduler: True, 39.47 QPS, connection latency: 0.001820s, query latency: 0.000356s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:05:32] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: staging on dbstore1005 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:05:43] <elukey>	 marostegui: so reboots completed for dbstores :P
[10:05:50] <marostegui>	 \o/
[10:05:50] <icinga-wm>	 RECOVERY - MariaDB read only s8 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 92s, read_only: True, event_scheduler: True, 6680.22 QPS, connection latency: 0.005290s, query latency: 0.000674s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:05:52] <marostegui>	 thanks
[10:05:53] <kormat>	 elukey: there's a terrible script on cumin1001, `/home/kormat/bin/reboot-host`
[10:05:58] <kormat>	 it'll eventually become a cookbook
[10:06:00] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:04] <kormat>	 but might be worth looking at if you're doing much of this
[10:06:04] <icinga-wm>	 RECOVERY - MariaDB read only s6 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 87s, read_only: True, event_scheduler: True, 4015.24 QPS, connection latency: 0.003376s, query latency: 0.000280s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:06:06] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s8 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:10] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:14] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544
[10:06:16] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:18] <elukey>	 kormat: I wanted to ask about it, definitely interested in working on a cookbook 
[10:06:28] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: x1 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:36] <icinga-wm>	 RECOVERY - MariaDB Replica IO: x1 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:40] <elukey>	 kormat: is it an attempt to nerd-snipe me I suppose :D
[10:06:47] <elukey>	 *it is
[10:06:58] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:07:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add the add_user filter (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto)
[10:07:11] <wikibugs>	 (03PS12) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[10:07:14] <elukey>	 I can fall into the trap if you review some reuse-partman recipe that I am going to send in a bit :D
[10:07:39] <kormat>	 elukey: i can neither confirm nor deny
[10:08:22] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:01] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) 05Open→03Resolved a:03Joe Given the original issue is definitely resolved, there is no point in keeping this ta...
[10:12:16] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) a:03Joe
[10:12:54] <_joe_>	 jouncebot: next
[10:12:55] <jouncebot>	 In 1 hour(s) and 47 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1200)
[10:13:22] <_joe_>	 !log depooling mw1331 to perform some tests for T266855
[10:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:29] <stashbot>	 T266855: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855
[10:19:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[10:19:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782) (owner: 10Jbond)
[10:21:51] <jbond42>	 kormat re the reboot-host script you have perhaps yuo could review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/657102 to ensure it flexible enough to account for the use cases (cough volans)
[10:23:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: set up conntrack sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/664785 (https://phabricator.wikimedia.org/T272963)
[10:23:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10MoritzMuehlenhoff) p:05Triage→03Medium
[10:23:47] <wikibugs>	 10SRE, 10Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10MoritzMuehlenhoff) p:05Triage→03High
[10:24:20] <wikibugs>	 10SRE: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10MoritzMuehlenhoff) p:05Triage→03Medium
[10:25:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: set up conntrack sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/664785 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[10:26:19] <kormat>	 jbond42: the pre/post scripts: do they run on the cumin hosts, or on the target hosts?
[10:26:32] <kormat>	 (i need both)
[10:27:21] <kormat>	 i guess i could shell out to cumin from the cumin hosts; it doesn't feel very clean though
[10:27:45] <volans>	 kormat: what are you trying to do?
[10:27:53] * volans missing context
[10:27:56] <kormat>	 volans: amazing things
[10:27:59] <jbond42>	 kormat: pre_scripts is a list of scripts that run on the host.  pre_action is a function that by default calls self._run_scripts(self.pre_scripts, hosts) so you could hook that to do somthing on cumin furst 
[10:28:00] <volans>	 ofc :D
[10:28:06] <kormat>	 volans: /home/kormat/bin/reboot_host on cumin1001
[10:28:35] <volans>	 it-s reboot-host :D
[10:28:47] <kormat>	 congrats, you got past the first guardian!
[10:29:03] <volans>	 to roommate agreement?™
[10:29:09] <kormat>	 jbond42: can you pass parameters to the scripts run on the target host?
[10:29:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:29:28] <volans>	 kormat: why that's not a cookbook?
[10:29:49] <kormat>	 volans: it will be, eventually. currently too many pre-reqs aren't in place.
[10:30:00] <volans>	 happy to chat/know what they are
[10:30:21] <_joe_>	 high latency in codfw??
[10:30:36] <jbond42>	 kormat: short answer is yes depending on what it is exactly yuo need to do you would need to hook pre_scripts or pre_actions
[10:30:43] <_joe_>	 what happened in codfw at 9:50?
[10:31:24] <_joe_>	 oh I see
[10:31:32] <_joe_>	 a ton of requests to mwdebug2002
[10:32:33] <wikibugs>	 (03CR) 10ArielGlenn: "PCC looks as I expect: https://puppet-compiler.wmflabs.org/compiler1002/28104/" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[10:36:50] <wikibugs>	 (03PS1) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788
[10:37:38] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "My first pass, sorry if I commented on something already discussed in the previous PSes." (0320 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[10:44:52] <wikibugs>	 (03PS2) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788
[10:48:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 390 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:49:41] <_joe_>	 that is me
[10:50:37] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: keepalived: use nopreempt option [puppet] - 10https://gerrit.wikimedia.org/r/664789 (https://phabricator.wikimedia.org/T272963)
[10:51:40] <wikibugs>	 (03PS3) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788
[10:52:00] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure
[10:52:01] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure
[10:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) This hasn't arrived yet, right? It would be useful to have one large capacity system for [[ https://phabricator.wikimedia.org/T267338 | next week's test ]], but this is unrelated...
[10:52:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks John for the details, LGTM, there are just the two nit comments left, none is a blocker." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[10:53:02] <wikibugs>	 (03PS4) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788
[10:54:04] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) Adding local dc ops on CC of this ticket- things would have to go really bad to needing him for this test (this should be a relatively boring proc...
[10:54:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: keepalived: use nopreempt option [puppet] - 10https://gerrit.wikimedia.org/r/664789 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[10:55:34] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:56:20] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) First observation I can make is that most requests are done by the math extension, and usually go in pairs...
[10:56:49] <wikibugs>	 (03CR) 10Elukey: "Kormat: if you have time, here some examples of hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey)
[11:02:30] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[11:03:54] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[11:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[11:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[11:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:34] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[11:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:35] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[11:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[11:08:47] <icinga-wm>	 RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:27] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.32.8:9042 on maps1009 is OK: TCP OK - 0.013 second response time on 10.64.32.8 port 9042 https://phabricator.wikimedia.org/T93886
[11:11:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto)
[11:17:05] <wikibugs>	 (03CR) 10Jbond: install_server/dhcp: dhcpd.conf include mechanism support machinery (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[11:17:15] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "I have comments." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey)
[11:17:33] <kormat>	 ^ for when -1 isn't strong enough
[11:19:52] <elukey>	 kormat: I was counting on it yes
[11:19:58] <kormat>	 :D
[11:21:08] <elukey>	 should have used lsblk, +1 
[11:22:43] <elukey>	 thanks a lot for the feeback, going to work on it and fill my ignorance gaps
[11:22:51] <kormat>	 yw <3
[11:24:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14389 and previous config saved to /var/cache/conftool/dbconfig/20210217-112422-marostegui.json
[11:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:28] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[11:32:11] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:34:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:34:57] <wikibugs>	 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:35:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: P:services_proxy::envoy: add keepalive to restbase-https [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855)
[11:38:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[11:42:39] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1003 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[11:45:34] <wikibugs>	 (03PS1) 10Phuedx: Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798)
[11:45:36] <wikibugs>	 (03PS1) 10Phuedx: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798)
[11:49:51] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[11:50:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855)
[11:50:12] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855)
[11:55:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netbox-dev2001.wikimedia.org
[11:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:48] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2001.wikimedia.org
[11:58:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1200).
[12:00:04] <jouncebot>	 phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:29] <phuedx>	 o/
[12:01:16] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: start conntrackd before keepalived [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963)
[12:01:49] <Urbanecm>	 i can deploy today
[12:01:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] vector: Enable search treatment AB test on test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:02:43] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[12:02:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:03:04] <wikibugs>	 (03PS2) 10Phuedx: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798)
[12:03:22] <Urbanecm>	 thanks phuedx, sounds better :)
[12:04:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:04:14] <Majavah>	 Urbanecm: fiwiki VPT has reports of T273317 resurfacing :/
[12:04:14] <stashbot>	 T273317: some users with access are unable to configure pending changes - https://phabricator.wikimedia.org/T273317
[12:04:25] <Urbanecm>	 Majavah: oh....
[12:04:42] <Urbanecm>	 will look after b&c
[12:04:56] <Urbanecm>	 phuedx: pulled onto mwdebug1001, can you check?
[12:04:59] <Majavah>	 and I can't stabilize on test2wiki either, https://test2.wikipedia.org/w/index.php?title=Special:Stabilization&page=16th_december
[12:05:04] <Majavah>	 should I open that task or make a new one?
[12:05:32] <Urbanecm>	 Majavah: i can stabilize there..
[12:05:37] <Urbanecm>	 ...but maybe that's because I'm a S?
[12:05:40] <phuedx>	 Urbanecm: Tested. LGTM
[12:05:44] <Urbanecm>	 thanks, syncing
[12:06:40] <wikibugs>	 (03CR) 10Phuedx: vector: Enable search treatment AB test on test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:06:58] <Majavah>	 Urbanecm: my test2wiki +sysop expired, that might explain it
[12:07:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:07:14] <Urbanecm>	 Majavah: ah
[12:07:40] <Urbanecm>	 renewed
[12:07:41] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7872251778b65cb03eb5457f1b901d208d514609: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 25s)
[12:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:47] <stashbot>	 T259798: Deploy the new Vue.js search experience to the Beta-Cluster and Test Wikipedia - https://phabricator.wikimedia.org/T259798
[12:07:53] <wikibugs>	 (03Merged) 10jenkins-bot: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx)
[12:08:06] <Majavah>	 now I can stabilize
[12:08:21] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[12:08:34] <Urbanecm>	 phuedx: your second patch is at mwdebug1001
[12:08:42] <phuedx>	 Urbanecm: Thanks. Testing now
[12:09:14] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[12:10:35] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/desktop-improvements.dblist: 7872251778b65cb03eb5457f1b901d208d514609: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 09s)
[12:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:48] <Majavah>	 Urbanecm: do you see the form on fiwiki? https://fi.wikipedia.org/w/index.php?title=Toiminnot:Vakauta_sivu&page=Coari for example
[12:12:02] <Urbanecm>	 Majavah: I don't
[12:12:12] <Urbanecm>	 but I'm also not a fiwiki editor
[12:12:16] <phuedx>	 Urbanecm: LGTM. Thanks
[12:12:18] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh conntrackd service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963)
[12:12:22] <Urbanecm>	 thanks, syncing
[12:14:03] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6eeee95e090408c8bd35d14c2f76e3afd8a59048: vector: Enable search treatment AB test on test wikis (T259798) (duration: 01m 08s)
[12:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:08] <stashbot>	 T259798: Deploy the new Vue.js search experience to the Beta-Cluster and Test Wikipedia - https://phabricator.wikimedia.org/T259798
[12:14:11] <Urbanecm>	 and should be live :)
[12:14:14] <Urbanecm>	 anything else phuedx ?
[12:14:27] <phuedx>	 That's it for me. Thanks for deploying those changes Urbanecm
[12:14:34] <Urbanecm>	 no problem :)
[12:18:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh conntrackd service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[12:20:11] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[12:21:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805
[12:22:10] <wikibugs>	 (03PS2) 10Awight: New 2FA device [puppet] - 10https://gerrit.wikimedia.org/r/662661
[12:26:05] <wikibugs>	 (03PS3) 10Awight: New 2FA key for awight [puppet] - 10https://gerrit.wikimedia.org/r/662661
[12:26:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: ignore errors on ip token set [puppet] - 10https://gerrit.wikimedia.org/r/664806
[12:26:46] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[12:27:05] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "small typo, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez)
[12:27:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan)
[12:29:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: ignore errors on ip token set [puppet] - 10https://gerrit.wikimedia.org/r/664806 (owner: 10Arturo Borrero Gonzalez)
[12:38:01] <wikibugs>	 (03PS1) 10Volans: pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809
[12:38:03] <wikibugs>	 (03PS1) 10Volans: fileio: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810
[12:38:06] <wikibugs>	 (03PS1) 10Volans: fileio: manage blocks of text in files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664811
[12:39:32] <wikibugs>	 (03CR) 10Volans: "I did make the code but because we might not be using it right now not sure if it's worth to add or not. I didn't write the test that woul" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664811 (owner: 10Volans)
[12:40:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812
[12:40:08] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840)
[12:40:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 20%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14390 and previous config saved to /var/cache/conftool/dbconfig/20210217-124015-root.json
[12:40:17] <wikibugs>	 (03PS1) 10Jbond: make ca_source optional [puppet] - 10https://gerrit.wikimedia.org/r/664814
[12:40:19] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/664815
[12:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:21] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:49] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[12:41:02] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1172 is now being automatically pooled into s8
[12:41:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM I can't recall by memory if it needs any additional tweak in other files" [puppet] - 10https://gerrit.wikimedia.org/r/664812 (owner: 10Muehlenhoff)
[12:42:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[12:42:38] <wikibugs>	 (03CR) 10Noa wmde: [C: 03+1] Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE))
[12:42:41] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:42:52] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:23] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[12:43:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812 (owner: 10Muehlenhoff)
[12:45:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:13] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:04] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1146 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:48:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] make ca_source optional [puppet] - 10https://gerrit.wikimedia.org/r/664814 (owner: 10Jbond)
[12:49:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/664815 (owner: 10Jbond)
[12:49:57] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:08] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:50:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840)
[12:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 40%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14391 and previous config saved to /var/cache/conftool/dbconfig/20210217-125519-root.json
[12:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:56] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:00] <wikibugs>	 (03PS2) 10JMeybohm: admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254)
[13:05:59] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:02] <wikibugs>	 (03PS3) 10Muehlenhoff: Add unprivileged Cumin master(s) to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840)
[13:10:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14392 and previous config saved to /var/cache/conftool/dbconfig/20210217-131022-root.json
[13:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:03] <Lucas_WMDE>	 I’ll deploy a quick config change since the deployment calendar looks nicely free at the moment
[13:16:07] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032)
[13:16:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE))
[13:17:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE))
[13:19:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:664593|Enable Wikibase Repo ID generator rate limiting on Wikidata (T272032)]] (duration: 01m 11s)
[13:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:52] <stashbot>	 T272032: Add rate limit for creating Item IDs - https://phabricator.wikimedia.org/T272032
[13:21:43] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "This was done as part of Iba56724e62720dc2e3bdfd0837e1ced4cb337586" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) (owner: 10DannyS712)
[13:25:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 60%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14393 and previous config saved to /var/cache/conftool/dbconfig/20210217-132526-root.json
[13:25:27] <wikibugs>	 (03PS1) 10JMeybohm: initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254)
[13:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:54] <wikibugs>	 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:26:26] <wikibugs>	 10SRE, 10Traffic: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:28:32] <moritzm>	 !log installing libzstd security updates on Buster
[13:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:58] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145)
[13:30:21] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:32] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:50] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Please check if this looks right, these configs are wonky." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński)
[13:31:32] <wikibugs>	 (03PS1) 10Kormat: integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821
[13:31:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libzstd [puppet] - 10https://gerrit.wikimedia.org/r/664822
[13:35:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libzstd [puppet] - 10https://gerrit.wikimedia.org/r/664822 (owner: 10Muehlenhoff)
[13:36:48] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821 (owner: 10Kormat)
[13:38:11] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/664824
[13:39:16] <wikibugs>	 (03Merged) 10jenkins-bot: integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821 (owner: 10Kormat)
[13:40:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 80%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14395 and previous config saved to /var/cache/conftool/dbconfig/20210217-134030-root.json
[13:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:29] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:37] <wikibugs>	 10SRE, 10Traffic: validate or revert the new large_objects_cutoff & nule_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 (10CDanis)
[13:41:55] <wikibugs>	 (03PS2) 10CDanis: Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028)
[13:42:14] <wikibugs>	 (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/28108/" [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis)
[13:42:54] <wikibugs>	 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) Current status: - `integration-env` script created to build docker image, download & cache bin...
[13:43:44] <wikibugs>	 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) What's not integration-tested yet: - db-compare - db-stop-in-sync - db-switchover
[13:44:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add unprivileged Cumin master(s) to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[13:46:57] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840)
[13:51:22] <wikibugs>	 (03Abandoned) 10DannyS712: Activate DiscussionTools as a beta feature on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) (owner: 10DannyS712)
[13:55:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14396 and previous config saved to /var/cache/conftool/dbconfig/20210217-135533-root.json
[13:55:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: default kubernetes policies: Add staging-codfw, remove logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826
[13:55:36] <wikibugs>	 (03PS1) 10Kormat: README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829
[13:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico: Remove default-deny GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827
[13:55:40] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828
[13:56:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 (10CDanis)
[14:06:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) >>! In T258413#6836429, @MoritzMuehlenhoff wrote: > @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you...
[14:07:17] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:24] <wikibugs>	 (03PS3) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840)
[14:07:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[14:08:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Why is that any better?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris)
[14:08:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[14:09:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[14:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[14:11:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis)
[14:12:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis)
[14:16:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Patch Set 1: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris)
[14:16:56] <wikibugs>	 (03PS4) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840)
[14:18:08] <wikibugs>	 (03PS1) 10CDanis: Revert "Increase nuke_limit in upload@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/664657 (https://phabricator.wikimedia.org/T275028)
[14:18:59] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Revert "Increase nuke_limit in upload@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/664657 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis)
[14:19:01] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[14:19:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[14:26:25] <cdanis>	 !log starting rolling restart of cp-upload@eqsin varnish-fe T275028
[14:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:30] <stashbot>	 T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028
[14:27:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830
[14:29:44] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] tegola: remove image in favour of blubber-built image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[14:30:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[14:36:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto)
[14:40:50] <wikibugs>	 (03PS1) 10Kormat: integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831
[14:40:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[14:42:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28111/console" [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto)
[14:45:16] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831 (owner: 10Kormat)
[14:46:36] <wikibugs>	 (03PS5) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788
[14:49:23] <wikibugs>	 (03Merged) 10jenkins-bot: integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831 (owner: 10Kormat)
[14:50:58] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for following up! I am going to get another -2 probably but hopefully this time the code is better :)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey)
[14:53:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] P:services_proxy::envoy: add keepalive to restbase-https [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto)
[14:58:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[14:58:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10Ottomata) Approved.
[15:00:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris)
[15:03:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Ottomata) Hi @CBogen, do you need direct access to data in Hadoop and Hive, or will you just be using Superset to access that data via Presto / Druid?  We've since...
[15:03:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Deployment access for Gabriele Modena - https://phabricator.wikimedia.org/T275020 (10WDoranWMF) As @gmodena's manager I approve this access.
[15:06:48] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 (owner: 10Volans)
[15:08:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto)
[15:09:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris)
[15:10:43] <wikibugs>	 (03Merged) 10jenkins-bot: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto)
[15:12:07] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "This... actually looks good to me. :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey)
[15:12:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) >>! In T258413#6837821, @Ottomata wrote: > Hi @CBogen, do you need direct access to data in Hadoop and Hive, or will you just be using Superset to access t...
[15:13:04] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet
[15:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:00] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet
[15:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:43] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830 (owner: 10Filippo Giunchedi)
[15:20:04] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet
[15:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:27] <icinga-wm>	 RECOVERY - Check no envoy runtime configuration is left persistent on mw1331 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[15:24:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto)
[15:26:07] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805
[15:26:59] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet
[15:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:44] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
[15:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:21] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] Add simple blubber image (031 comment) [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan)
[15:31:34] <moritzm>	 !log uploaded jasper 1.900.1-debian1-2.4+deb8u6+wmf3 to apt.wikimedia.org 
[15:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:42] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] "Another question I have, isn't this repository supposed to be an upstream mirror? Adding custom functionality, like the CI pipeline, could" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan)
[15:32:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris)
[15:32:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris)
[15:33:37] <wikibugs>	 (03Merged) 10jenkins-bot: default kubernetes policies: Add staging-codfw, remove logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris)
[15:34:13] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Remove default-deny GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris)
[15:34:57] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
[15:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:10] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[15:35:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[15:35:28] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828
[15:36:54] <cdanis>	 !log T275028 rolling restart done; check for fetch failures once caches re-fill
[15:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:58] <stashbot>	 T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028
[15:39:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez)
[15:42:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Add nikkin and gmodena to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021)
[15:42:47] <wikibugs>	 (03CR) 10Jgiannelos: Add simple blubber image (031 comment) [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan)
[15:43:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add nikkin and gmodena to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021) (owner: 10Muehlenhoff)
[15:44:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) Hi @MoritzMuehlenhoff . Thanks for your help on this, I have managed to set up the config and ssh in. One question; I am not able to connect...
[15:44:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) 05Resolved→03Open
[15:45:05] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani)
[15:45:27] <wikibugs>	 (03CR) 10ArielGlenn: "not this way, please add them to platform-engineering as they are members of that team, that's what I tried to say in the task." [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021) (owner: 10Muehlenhoff)
[15:45:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Deployment access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T275021 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @nikkin: You have been added to the deployment group.
[15:45:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Deployment access for Gabriele Modena - https://phabricator.wikimedia.org/T275020 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @gmodena : You have been added to the deployment group.
[15:45:57] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) >>! In T274459#6834660, @thcipriani wrote: > Hi @Dzahn apologies if there's a format for these kinds of requests that I missed: am I missing any info or tags for this request?  Ah ha! Fou...
[15:47:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release 3.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/664844
[15:51:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron l3: activate net.netfilter.nf_conntrack_tcp_be_liberal [puppet] - 10https://gerrit.wikimedia.org/r/664845 (https://phabricator.wikimedia.org/T268335)
[15:51:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Move nikkin and gmodena to platform-engineering instead (which also grants deployment) [puppet] - 10https://gerrit.wikimedia.org/r/664846
[15:52:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move nikkin and gmodena to platform-engineering instead (which also grants deployment) [puppet] - 10https://gerrit.wikimedia.org/r/664846 (owner: 10Muehlenhoff)
[15:53:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[15:54:50] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828
[15:54:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Move nikkin and gmodena to platform-engineering [puppet] - 10https://gerrit.wikimedia.org/r/664846
[15:55:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[15:55:55] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission analytics10[42-57] - https://phabricator.wikimedia.org/T267932 (10wiki_willy)
[15:55:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[15:56:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28113/" [puppet] - 10https://gerrit.wikimedia.org/r/664845 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez)
[15:58:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move nikkin and gmodena to platform-engineering [puppet] - 10https://gerrit.wikimedia.org/r/664846 (owner: 10Muehlenhoff)
[15:58:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10elukey) Hi @ChristineDeKock, can you try to use `christinedk` as username and then the password of the wikitech credentials?
[16:02:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[16:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris)
[16:04:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 3.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/664844 (owner: 10Giuseppe Lavagetto)
[16:04:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) @elukey I have tried this and it fails.
[16:05:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks ok to me, hard to tell if it will work at first try given the amount of things moved around. Thanks a lot Luca for the effort!" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[16:05:50] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided)
[16:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez)
[16:06:21] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided) (duration: 00m 30s)
[16:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:31] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Ottomata) Hiya, Christine will need LDAP membership in the `nda` group for this access.  @ChristineDeKock FYI, we are slowly working towards using Conda based...
[16:18:31] <wikibugs>	 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) @Ottomata just one more question for you!  >>! In T263496#6744142, @CDanis wrote: >>>! In T263496#6744057, @Ottomata wrote: >> The long term solution here is still not cl...
[16:20:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:20:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1035.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/...
[16:22:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 16 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:22:30] <wikibugs>	 (03CR) 10Esanders: [C: 03+1] Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński)
[16:23:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10MoritzMuehlenhoff) @ChristineDeKock : I've added you to the cn=nda LDAP group, can you please retry?
[16:25:29] <wikibugs>	 (03PS1) 10Urbanecm: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646)
[16:32:54] <moritzm>	 !log installing intel-microcode security updates on buster
[16:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[16:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add ulogd ecs filter + tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[16:40:23] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Nice - looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/664602 (owner: 10Ahmon Dancy)
[16:40:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikilabels: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/664752 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[16:41:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:22] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[16:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:49] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[16:44:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851
[16:44:51] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[16:45:37] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[16:46:20] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254)
[16:46:21] <godog>	 !log roll-restart logstash to apply ulogd filter - T234565
[16:46:24] <wikibugs>	 (03PS1) 10JMeybohm: envoy-future: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664854 (https://phabricator.wikimedia.org/T274254)
[16:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:27] <stashbot>	 T234565: Standardize the logging format - https://phabricator.wikimedia.org/T234565
[16:46:28] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[16:47:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851 (owner: 10Alexandros Kosiaris)
[16:47:53] <wikibugs>	 (03PS1) 10RobH: fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666)
[16:48:18] <wikibugs>	 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10CDanis)
[16:49:05] <wikibugs>	 (03Merged) 10jenkins-bot: calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851 (owner: 10Alexandros Kosiaris)
[16:49:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MattCleinman) Thanks! @Tnegrin is currently my manager. Will make sure he approves this ticket.
[16:50:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) Thank you, it works! I now have access with username christinedk + my wikitech password, using the Newpyter instructions.
[16:51:01] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[16:52:08] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757)
[16:52:11] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[16:52:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10Tnegrin) approved
[16:53:58] <wikibugs>	 (03CR) 10JMeybohm: "I tested this locally by now and it seems to work fine." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[16:54:04] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Add "minimum hits" support to logspam/logspam-watch [puppet] - 10https://gerrit.wikimedia.org/r/664602 (owner: 10Ahmon Dancy)
[16:54:52] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571)
[16:55:23] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[16:56:58] <wikibugs>	 (03CR) 10Volans: "Replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro)
[16:56:58] <godog>	 the logstash alerts is me, should recover shortly
[16:57:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:57:22] <godog>	 or not
[16:57:22] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[16:57:32] <icinga-wm>	 PROBLEM - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 #page on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:57:40] <godog>	 sorry that's going to page I think
[16:57:43] <godog>	 yes :(
[16:57:52] <legoktm>	 Hi
[16:57:57] <godog>	 my bad
[16:57:58] <apergos>	 indeed
[16:58:09] <robh>	 whew i was like
[16:58:10] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[16:58:13] <godog>	 I'll revert
[16:58:13] * volans ignoring as stated above
[16:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:16] <robh>	 'im installing new logstash, what did i do wrong'
[16:58:35] <shdubsh>	 godog: I'll prep a patch
[16:58:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "logstash: add ulogd ecs filter + tests" [puppet] - 10https://gerrit.wikimedia.org/r/664661
[16:58:50] <wikibugs>	 (03PS2) 10RobH: fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666)
[16:58:51] <godog>	 shdubsh: for the revert or the fix ?
[16:58:58] <shdubsh>	 godog: fix
[16:59:04] * robh isnt merging that until outage is over no worries
[16:59:12] <robh>	 i didnt mean to click rebase =P
[16:59:17] <herron>	 am about to get on a call, but please ping me if you need an extra set of hands
[16:59:19] <godog>	 robh: go ahead if you wish, that's fine
[16:59:31] <robh>	 oh, i just didnt want to add to noise, my change is unrelated
[16:59:35] <robh>	 apologies!
[16:59:44] <godog>	 shdubsh: ack, thank you
[17:00:23] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757)
[17:00:38] <wikibugs>	 (03CR) 10RobH: [C: 03+2] fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666) (owner: 10RobH)
[17:00:46] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878
[17:00:49] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.4167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:01:09] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[17:02:45] <wikibugs>	 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) Let's do the former, I think doing the latter (using schemas to configure included data) is going to be the right solution after all.  So, special case these headers ju...
[17:02:46] <wikibugs>	 (03CR) 10Elukey: sre.hosts.decommission: move to class API (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[17:03:20] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE
[17:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:37] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: fix the centralauth management bit of the view scripts [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523)
[17:04:12] <wikibugs>	 (03PS1) 10Cwhite: profile: disable ulogd_ecs filter on legacy logstash [puppet] - 10https://gerrit.wikimedia.org/r/664861
[17:04:50] <wikibugs>	 (03CR) 10Bstorm: "As we wind down the old replicas, we are going to want to condense some of the filtering steps in this." [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm)
[17:05:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE
[17:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:00] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[17:06:09] <wikibugs>	 (03CR) 10Bstorm: "This is already tested via livehack, so I'll merge as soon as the jenkins job finishes." [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm)
[17:06:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28114/console" [puppet] - 10https://gerrit.wikimedia.org/r/664861 (owner: 10Cwhite)
[17:07:11] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:07:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: disable ulogd_ecs filter on legacy logstash [puppet] - 10https://gerrit.wikimedia.org/r/664861 (owner: 10Cwhite)
[17:07:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1035.eqiad.wmnet'] `  Of which those **FAILED**: ` ['logstash1035.eqiad.wmnet'] `
[17:07:44] <godog>	 shdubsh: thank you, I'll merge etc
[17:07:45] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:07:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:58] <rzl>	 o11y folks: it looks like the VO incident didn't auto-resolve again, do y'all have a phab task tracking that? I see T264016 T266570 T263423 for individual cases, but not sure if there's anything for the overall issue
[17:07:59] <stashbot>	 T266570: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570
[17:07:59] <stashbot>	 T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016
[17:07:59] <godog>	 oh nevermind, I see you did that already
[17:07:59] <stashbot>	 T263423: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423
[17:08:01] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[17:08:16] <godog>	 shdubsh: are you running puppet as well ?
[17:08:18] <shdubsh>	 heh, sorry.  was quick on the button
[17:08:27] <shdubsh>	 yeah
[17:09:11] <godog>	 rzl: there isn't an overall tracking task afaik no, which incident # tho ?
[17:09:28] <godog>	 shdubsh: thanks!
[17:09:47] <rzl>	 godog: https://portal.victorops.com/ui/wikimedia/incident/809/details from just now
[17:10:17] <godog>	 rzl: the alert didn't recover yet
[17:10:18] <rzl>	 oh! wait sorry, I saw a recovery and thought it was the page
[17:10:19] <rzl>	 never mind :)
[17:10:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE
[17:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1035.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/...
[17:11:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[17:12:24] <wikibugs>	 (03CR) 10Andrew Bogott: "ok, so catching up... given that everything is already sent to syslog for all DB servers (Galera and otherwise), this change will definite" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[17:12:28] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Openstack control node galera: send mariadb logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[17:12:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE
[17:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE
[17:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:07] <wikibugs>	 (03CR) 10Hnowlan: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:13:34] <wikibugs>	 (03PS35) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:13:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[17:13:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE
[17:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:22] <wikibugs>	 (03CR) 10David Caro: utils: add script to run docker ci tests locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro)
[17:14:38] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-da
[17:14:38] <icinga-wm>	 ar-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[17:14:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE
[17:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:23] <wikibugs>	 (03PS1) 10Cwhite: profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862
[17:15:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite)
[17:15:58] <wikibugs>	 (03PS2) 10Cwhite: profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862
[17:16:24] <wikibugs>	 (03CR) 10Bstorm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm)
[17:16:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Deployment access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T275021 (10nnikkhoui) Thank you @MoritzMuehlenhoff ! And thank you @ArielGlenn for requesting :)
[17:16:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:16:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE
[17:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:09] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[17:17:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite)
[17:18:20] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite)
[17:20:17] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix the centralauth management bit of the view scripts [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm)
[17:21:48] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 1:" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan)
[17:22:11] <wikibugs>	 (03PS2) 10Hnowlan: Add simple blubber image [software/tegola] - 10https://gerrit.wikimedia.org/r/664564
[17:22:11] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB daniel_zahn need reboot but test ongoing? https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:22:11] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB daniel_zahn need reboot but test ongoing? https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:22:23] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[17:25:06] <wikibugs>	 (03PS1) 10Cwhite: profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863
[17:25:28] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[17:25:39] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.825 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:26:31] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE
[17:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite)
[17:27:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite)
[17:27:32] <wikibugs>	 (03PS2) 10Cwhite: profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863
[17:27:36] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite)
[17:27:40] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm)
[17:28:31] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE
[17:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:36] <wikibugs>	 (03PS1) 10JMeybohm: prometheus-statsd-exporter: Run as nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664864 (https://phabricator.wikimedia.org/T274254)
[17:28:38] <wikibugs>	 (03PS1) 10JMeybohm: nutcracker: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664865 (https://phabricator.wikimedia.org/T274254)
[17:29:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:29:45] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[17:29:48] <icinga-wm>	 RECOVERY - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 #page on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 11514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:29:49] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[17:29:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[17:29:59] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[17:30:23] <rzl>	 VO auto-resolved after all \o/ sorry for doubting
[17:31:03] <godog>	 heheh that's fair
[17:31:37] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.887 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:31:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:34:15] <wikibugs>	 (03PS2) 10Elukey: sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925)
[17:35:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1035.eqiad.wmnet'] `  and were **ALL** successful.
[17:36:33] <godog>	 !log roll-restart logstash7 in codfw/eqiad to apply ulogd filters - T234565
[17:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:38] <stashbot>	 T234565: Standardize the logging format - https://phabricator.wikimedia.org/T234565
[17:36:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:38:08] <wikibugs>	 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff)
[17:38:46] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[17:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:39:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH)
[17:48:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830 (owner: 10Filippo Giunchedi)
[17:53:28] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] tegola: remove image in favour of blubber-built image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[17:54:11] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] role::maps: fix MOTD message [puppet] - 10https://gerrit.wikimedia.org/r/662659 (owner: 10Hnowlan)
[17:58:55] <wikibugs>	 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) (Sorry, just edited ^, somehow a very important 'not' did not make it through my typing fingers)
[18:00:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) Thanks @MoritzMuehlenhoff!   I have managed to set up the config and ssh in but I am not able to connect to JupyterLab (https://wikitech.wikimedia.org/wiki/...
[18:00:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:01:32] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:05:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Ottomata) Hi Pablo! As a WMF employee you should be added to the `wmf` LDAP group.  This should allow you to connect.
[18:06:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans)
[18:07:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn)
[18:07:46] <effie>	 !log disable puppet on mw* in eqiad 
[18:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:54] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1275.eqiad.wmnet'] `  and were **ALL** s...
[18:08:01] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10bd808) This bug and {T274090} are both semi-innocuous in that neither is currently causing IABot or the wikis to break, bu...
[18:08:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1275.eqiad.wmnet
[18:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:31] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1344.eqiad.wmnet'] `  and were **ALL** s...
[18:08:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1344.eqiad.wmnet
[18:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add php 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664884
[18:09:39] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1343.eqiad.wmnet'] `  and were **ALL** s...
[18:11:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1343.eqiad.wmnet
[18:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:18] <mutante>	 !log mw1350 - powercycled via mgmt
[18:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn)
[18:17:12] <wikibugs>	 (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571)
[18:17:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn)
[18:18:29] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1350.eqiad.wmnet'] `  and were **ALL** s...
[18:18:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[18:19:29] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1350.eqiad.wmnet
[18:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:08] <wikibugs>	 (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro)
[18:20:17] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855
[18:20:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[18:20:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:22] <stashbot>	 T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855
[18:21:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[18:21:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1275.eqiad.wmnet
[18:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1343.eqiad.wmnet
[18:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:47] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1344.eqiad.wmnet
[18:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:41] <effie>	 !log enable puppet on mw*
[18:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1350.eqiad.wmnet
[18:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:24] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kibana: set vega.enabled: false by default [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron)
[18:34:29] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I have learned there is an Apereo CAS test instance on WMCS reachable at idp.wmcloud.org . Some infos at T274461#6835716" [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox)
[18:38:51] <wikibugs>	 (03CR) 10Bstorm: "Does that address your concerns Alex?" [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm)
[18:40:10] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855 (duration: 19m 54s)
[18:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:16] <stashbot>	 T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855
[18:47:27] <akosiaris>	 Pchelolo: \o/ !!! Thanks!
[18:47:44] <Pchelolo>	 akosiaris: sorry it  took so long, I forgot about it
[18:47:58] <Pchelolo>	 I believe now your puppet patch is safe
[18:48:36] <akosiaris>	 Pchelolo: I 've waited so long for undeploying the service that I did not even notice ;)
[18:48:50] <akosiaris>	 ok will proceed with the undeploy tomorrow EU morning then
[18:54:05] <mutante>	 congrats on killing graphoid
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1900).
[19:00:04] <jouncebot>	 DannyS712 and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:04] <jouncebot>	 marxarelli and twentyafterfour: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1900).
[19:00:12] <DannyS712>	 here
[19:00:24] <Urbanecm>	 i can deploy today
[19:00:32] <MatmaRex>	 i gave just no-op cleanup patches
[19:00:39] <MatmaRex>	 have*
[19:01:29] <Urbanecm>	 ack
[19:02:08] <Urbanecm>	 DannyS712: your patch is WIP
[19:03:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:03:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:03:19] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[19:03:22] <DannyS712>	 Urbanecm fixed
[19:03:28] <Urbanecm>	 thx
[19:04:12] <Urbanecm>	 MatmaRex: do you want me to pull them onto a mwdebug host, once they merge?
[19:04:13] <wikibugs>	 (03Merged) 10jenkins-bot: Remove uses of removed VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:04:35] <wikibugs>	 (03PS2) 10Urbanecm: Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:04:39] <MatmaRex>	 Urbanecm: there's nothing to test, unless i made a typo or something
[19:04:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:04:45] <Urbanecm>	 MatmaRex: ack
[19:05:34] <wikibugs>	 (03Merged) 10jenkins-bot: Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński)
[19:05:41] <Urbanecm>	 syncing CS.php
[19:05:57] <wikibugs>	 (03PS3) 10Urbanecm: Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[19:06:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[19:06:51] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 6ac78bd2aa601db537f821c89b447c04927af422: Remove uses of removed VisualEditor config variables (T273177; 1/2) (duration: 01m 14s)
[19:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:56] <stashbot>	 T273177: Remove unused config options VisualEditorNewAccountEnableProportion and VisualEditorAutoAccountEnable - https://phabricator.wikimedia.org/T273177
[19:07:21] <Urbanecm>	 and syncing IS.php
[19:07:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[19:07:28] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:08:06] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:08:21] <mutante>	 rzl: ^ I did not fix it but I am glad it is 
[19:08:24] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6ac78bd2aa601db537f821c89b447c04927af422: Remove uses of removed VisualEditor config variables (T273177; 2/2) (duration: 01m 07s)
[19:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:37] <Urbanecm>	 MatmaRex: deployed
[19:08:45] <MatmaRex>	 thanks
[19:08:52] <Urbanecm>	 mutante: it may be related to me doing deployments (which clears opcache in some cases iirc)
[19:09:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10nettrom_WMF) >>! In T258413#6837869, @CBogen wrote: >>>! In T258413#6837821, @Ottomata wrote: >> Hi @CBogen, do you need direct access to data in Hadoop and Hive,...
[19:09:22] <mutante>	 rzl: issues were found with mcrouter connecting to mcrouter proxies.. if they are mixed stretch-buster, but not for stretch-stretch or buster-buster, so the fix for that is upgrading all to buster 
[19:09:23] <Urbanecm>	 DannyS712: your patch is available at mwdebug1001, please test
[19:09:40] <mutante>	 Urbanecm: I think I was missing an extra reboot after reinstall maybe.. that would have cleared it
[19:09:54] <mutante>	 but then did not want to do that because others were working 
[19:10:03] <mutante>	 now it's just resolved anyways
[19:10:08] <Urbanecm>	 i see :)
[19:10:26] <Urbanecm>	 and thanks for the explanation in the list mutante 
[19:10:35] <Urbanecm>	 as long as scap updates the code there, I'm good :)
[19:11:32] <DannyS712>	 Urbanecm the special page loads, but "Skipped unresolvable module ext.globalwatchlist.specialglobalwatchlist" so it doesn't work. I'm guessing thats because the load.php request is going to a different server than mwdebug1001 - it works fine on testwiki
[19:11:54] <DannyS712>	 the settings page loads, but again without the styles module
[19:12:01] <Urbanecm>	 i would blame RL cache
[19:12:31] <Urbanecm>	 let's sync it
[19:12:33] <Majavah>	 why would load.php not go to mwdebug? it's from your browser after all
[19:12:41] <mutante>	 Urbanecm: ACK, yes, it should get scap updates:) in a couple weeks (?) I will ask you if we still need mwdebug1003
[19:12:51] <Urbanecm>	 sure :)
[19:13:29] <wikibugs>	 (03PS1) 10AndyRussG: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054)
[19:14:19] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 352dd72c28462755546ac36a017548a7f0925df0: Enable GlobalWatchlist extension on metawiki (T260862) (duration: 01m 07s)
[19:14:22] <Urbanecm>	 DannyS712: and...done!
[19:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:25] <stashbot>	 T260862: Deploy GlobalWatchlist extension to production (Meta only) - https://phabricator.wikimedia.org/T260862
[19:14:40] <Urbanecm>	 anything else?
[19:14:40] <DannyS712>	 it works!
[19:14:44] <Urbanecm>	 cool
[19:14:46] <DannyS712>	 thanks so much
[19:14:48] <Majavah>	 woo!
[19:16:49] <wikibugs>	 (03PS2) 10AndyRussG: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054)
[19:17:25] <wikibugs>	 (03CR) 10AndyRussG: [C: 04-1] Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG)
[19:17:28] <wikibugs>	 (03PS1) 10Urbanecm: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977)
[19:17:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1033.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/...
[19:17:44] <wikibugs>	 (03PS2) 10Urbanecm: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977)
[19:17:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) (owner: 10Urbanecm)
[19:18:48] <tabbycat>	 oh, global watchlist
[19:19:19] <wikibugs>	 (03PS1) 10Urbanecm: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976)
[19:19:21] <Urbanecm>	 hi tabbycat 
[19:19:30] <tabbycat>	 meow Urbanecm 
[19:19:58] <wikibugs>	 (03Merged) 10jenkins-bot: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) (owner: 10Urbanecm)
[19:20:16] <icinga-wm>	 RECOVERY - dhclient process on sretest1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[19:20:33] <wikibugs>	 (03PS2) 10Urbanecm: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976)
[19:20:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) (owner: 10Urbanecm)
[19:21:40] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a7eb726f01ab5332d8b8951fdd0fa0c5a9459d4c: tlwikibooks: Add WB as an alias to NS_PROJECT (T274977) (duration: 01m 09s)
[19:21:42] <wikibugs>	 (03Merged) 10jenkins-bot: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) (owner: 10Urbanecm)
[19:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:48] <stashbot>	 T274977: Addition of the WB: namespace alias in Tagalog Wikibooks - https://phabricator.wikimedia.org/T274977
[19:22:22] <rzl>	 mutante: ahh nod
[19:24:20] <Urbanecm>	 !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=tlwikibooks  --fix # T274977 # P14403
[19:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:58] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Do these machines just have to talk to each other (on what port/protocol btw?) or does it _really_ require that they are in wikimedia.org directly exposed to the Internet and without any cachi...
[19:26:27] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c37fa0115113fb31cb54d9cf3f18a13f656c73dd: tlwikibooks: Add Wikijunior namespace (T274976) (duration: 01m 09s)
[19:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:31] <stashbot>	 T274976: Addition of the Wikijunior: namespace in Tagalog Wikibooks - https://phabricator.wikimedia.org/T274976
[19:26:54] <legoktm>	 DannyS712: wheee congrats!!
[19:27:21] <legoktm>	  11:11:32 <DannyS712> Urbanecm the special page loads, but "Skipped unresolvable module ext.globalwatchlist.specialglobalwatchlist" so it doesn't work. I'm guessing thats because the load.php request is going to a different server than mwdebug1001 - it works fine on testwiki <-- shouldn't be possible, X-Wikimedia-Debug will route all requests, including load.php to the mwdebug server
[19:27:36] <Urbanecm>	 !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=tlwikibooks --fix # T274976 # P14404
[19:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:49] <Urbanecm>	 indeed, it worked for me when i tried it myself shortly before syncing
[19:27:53] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Machines that are directly exposed to the Internet and are managed manually are more of a challenge to security practices than internal machines I would think.
[19:27:54] <Urbanecm>	 (incl. style module)
[19:28:56] <Urbanecm>	 i think RL just needed some time to notice the new module
[19:32:22] <wikibugs>	 (03PS1) 10Urbanecm: arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844)
[19:32:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844) (owner: 10Urbanecm)
[19:32:43] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE
[19:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:38] <wikibugs>	 (03Merged) 10jenkins-bot: arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844) (owner: 10Urbanecm)
[19:33:47] * legoktm nods
[19:34:48] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE
[19:34:50] <Urbanecm>	 &debug=1 probably would've solve it
[19:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:09] <wikibugs>	 (03PS1) 10Urbanecm: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796)
[19:36:15] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6c5c5f0d1b83a7f05272f133c269c740af8352db: arbcom_ruwiki: Add arbcom user group (T274844) (duration: 01m 12s)
[19:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:20] <stashbot>	 T274844: arbcom-ru.wikipedia.org: add rights to bureaucrats usergroup - https://phabricator.wikimedia.org/T274844
[19:36:40] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) regarding the request for 24GB of RAM:  This would make these the VMs with the most memory globally... more than _anything_ else.  To give you an idea .. all existing ganeti VMs are between 1...
[19:37:48] <wikibugs>	 (03PS1) 10Urbanecm: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796)
[19:38:43] <wikibugs>	 (03PS2) 10Urbanecm: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796)
[19:38:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm)
[19:39:56] <wikibugs>	 (03Merged) 10jenkins-bot: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm)
[19:41:34] <wikibugs>	 (03PS2) 10Urbanecm: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796)
[19:41:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm)
[19:41:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1033.eqiad.wmnet'] `  and were **ALL** successful.
[19:42:27] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 88e6ebc5565a7a0b1431dd5f52c701d8df641990: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import (T274796) (duration: 01m 09s)
[19:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:32] <stashbot>	 T274796: Several permission changes for he.wikisource - https://phabricator.wikimedia.org/T274796
[19:42:44] <wikibugs>	 (03Merged) 10jenkins-bot: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm)
[19:44:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:44:42] <wikibugs>	 (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757)
[19:45:31] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:45:46] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2e521f76c195ab50ab28a7d4812a35ceac246907: hewikisource: Allow reviewers to rollback (T274796) (duration: 01m 10s)
[19:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH)
[19:49:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:49:11] * Urbanecm done
[19:49:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:49:52] <wikibugs>	 (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757)
[19:54:50] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "one last comment inline (well, a few, but only the one about default attribute is really important)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[19:56:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:57:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[19:57:46] <wikibugs>	 (03PS3) 10Dzahn: mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757)
[19:58:15] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Urbanecm) Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can you use a cloud-provided VM instead?
[20:00:04] <jouncebot>	 marxarelli and twentyafterfour: Dear deployers, time to do the Mediawiki train - American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T2000).
[20:05:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Ottomata) Ok, so just `wmf` LDAP and `analytics-privatedata-users` posix membership is needed.  Thank you.
[20:06:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Abit) > In T258413#6836429, @MoritzMuehlenhoff wrote: > @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you to a...
[20:07:30] <wikibugs>	 (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288 [puppet] - 10https://gerrit.wikimedia.org/r/664898 (https://phabricator.wikimedia.org/T245757)
[20:12:13] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899
[20:12:15] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899 (owner: 10Dduvall)
[20:13:07] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899 (owner: 10Dduvall)
[20:14:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1034.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/...
[20:15:53] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.31
[20:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:09] <logmsgbot>	 !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.31 (duration: 01m 15s)
[20:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:18] <wikibugs>	 (03PS1) 10Dzahn: admin: create new group for gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/664902 (https://phabricator.wikimedia.org/T274953)
[20:22:48] <marxarelli>	 wmf.31 seems pretty quiet
[20:23:05] <marxarelli>	 !log 1.36.0-wmf.31 rolled to group1. no new errors for wmf.31 (T271345)
[20:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:11] <stashbot>	 T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345
[20:24:35] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul)
[20:27:15] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075)
[20:28:40] <wikibugs>	 (03PS1) 10Dzahn: create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458)
[20:29:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE
[20:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:45] <wikibugs>	 (03CR) 10Dzahn: "The admin group this uses would be created in https://gerrit.wikimedia.org/r/c/operations/puppet/+/664902" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn)
[20:30:11] <wikibugs>	 (03CR) 10Dzahn: "the placeholder role this needs even if nothing else is puppetized .. would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/664904" [puppet] - 10https://gerrit.wikimedia.org/r/664902 (https://phabricator.wikimedia.org/T274953) (owner: 10Dzahn)
[20:30:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn)
[20:31:43] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE
[20:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:54] <wikibugs>	 (03PS2) 10Dzahn: create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458)
[20:38:18] <wikibugs>	 (03PS4) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673)
[20:39:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1034.eqiad.wmnet'] `  and were **ALL** successful.
[20:39:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[20:40:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH)
[20:41:46] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Joe) >>! In T274459#6839105, @Urbanecm wrote: > Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can...
[20:45:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] ""These VMs will not be puppetized" needs a thorough discussion." [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn)
[20:46:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) 05Open→03Resolved Ok, these are all setup and imaged, staged and ready for subteam takeover.
[20:54:15] <wikibugs>	 (03CR) 10Wolfgang Kandek: "Puppetization is planned to happen once we have SREs hired and they take over from the contractors. We estimate 6 months before that can h" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn)
[21:00:04] <jouncebot>	 chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T2100).
[21:06:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[21:06:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[21:06:50] <wikibugs>	 (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757)
[21:22:00] <wikibugs>	 (03PS1) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384)
[21:29:10] <wikibugs>	 (03PS2) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384)
[21:42:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: ship error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/664934 (https://phabricator.wikimedia.org/T268175)
[21:47:04] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH)
[21:47:54] <wikibugs>	 (03PS5) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673)
[21:47:57] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) a:03Jclark-ctr
[21:48:17] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH)
[21:49:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: ship error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/664934 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[22:02:36] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[22:04:20] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:08:36] <wikibugs>	 (03CR) 10Thcipriani: "> Patch Set 2: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn)
[22:11:32] <wikibugs>	 (03CR) 10Razzi: [C: 03+1] "Looks good, could do a bit of code cleanup" (032 comments) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata)
[22:27:17] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2035 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[22:27:29] <rzl>	 hi :(
[22:27:33] <legoktm>	 again
[22:27:42] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:27:59] <robh>	 here if needed but likely others better suited about.
[22:28:15] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Make DiscussionTools' replytool available for everyone on gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664940 (https://phabricator.wikimedia.org/T258554)
[22:28:38] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[22:28:43] <rzl>	 same pattern as yesterday except that we can certainly rule out the 304s as unrelated, if anyone was unsure
[22:29:10] <wikibugs>	 (03PS1) 10Urbanecm: hewikisource: Allow sysops to grant/revoke reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664941 (https://phabricator.wikimedia.org/T274796)
[22:29:34] <wikibugs>	 (03CR) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. (032 comments) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata)
[22:30:22] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:30:25] <cdanis>	 that is something 
[22:30:47] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6258 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[22:31:14] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:32:55] <legoktm>	 it's roughly the same time as yesterday too
[22:33:31] <rzl>	 yeah, about an hour later
[22:33:56] <wikibugs>	 (03CR) 10Aklapper: "Thanks for the quick merge! Nah, no pasting of results needed." [puppet] - 10https://gerrit.wikimedia.org/r/664002 (https://phabricator.wikimedia.org/T274711) (owner: 10Aklapper)
[22:34:11] <wikibugs>	 (03PS1) 10Andrew Bogott: lookup_table_output.json: Send horizon logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/664943 (https://phabricator.wikimedia.org/T268175)
[22:34:15] <RhinosF1>	 There's some replag alerts though today according to -databases that I don't remember going off yesterday but happened around same time as the page
[22:35:30] <rzl>	 active worker count is still elevated, I wouldn't be surprised if this pages again, still digging
[22:36:23] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:36:30] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:36:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] lookup_table_output.json: Send horizon logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/664943 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[22:36:52] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.  >>! In T274459#6839006, @Dzahn wrote: > regarding...
[22:38:08] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:38:15] <wikibugs>	 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani)
[22:38:16] <legoktm>	 rzl: what are you looking through?
[22:38:53] <rzl>	 legoktm: atm just dashboards -- sshing to an api server now to poke through logs and see if anything stands out
[22:39:16] <tabbycat>	 legoktm: if this is happening always around the same time, I'd see if there's some maintenance.pp cron behind :)
[22:39:25] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2604 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[22:40:44] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[22:41:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:41:59] <legoktm>	 tabbycat: I don't see anything active right now
[22:42:03] <legoktm>	 just background parser cache purge
[22:42:22] <legoktm>	 and extensions/MediaModeration/maintenance/ModerateExistingFiles.php
[22:42:51] <tabbycat>	 MediaModeration, that's new to me
[22:42:59] <RhinosF1>	 legoktm: when did that start
[22:43:16] <legoktm>	 when did what start?
[22:43:24] <RhinosF1>	 Because swift did come up yesterday iirc and swift = files in the most basic sense
[22:43:30] <RhinosF1>	 legoktm: the moderate files script
[22:44:42] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 29 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:45:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:45:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:46:11] <rzl>	 legoktm: everything in the slowlog looks like it's hanging in DB calls, and s1 open connections are spiking, same as yesterday https://grafana.wikimedia.org/goto/BYO6_XPGz
[22:49:37] <legoktm>	 can we look at the queries themselves?
[22:49:51] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5424 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[22:50:03] <rzl>	 a DBA would know how :P I'm trying to avoid having to wake one up, but will if I can't figure this out before too long
[22:50:15] <legoktm>	 https://tendril.wikimedia.org/activity?research=0&labsusers=0
[22:50:32] <rzl>	 ah cheers
[22:51:23] <legoktm>	 is it okay if we just kill those queries?
[22:51:36] <legoktm>	 they're stacking up too
[22:51:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:52:31] <cdanis>	 legoktm: sounds good to me
[22:52:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[22:53:19] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1208 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[22:53:24] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1308 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:56:00] <Bsadowski1>	 CSS is loading slowly
[22:56:25] <Urbanecm>	 legoktm: rzl: do we know which queries cause this? We can try to disable the feature that generates them if needed
[22:56:35] <RhinosF1>	 Bsadowski1: id guess expected
[22:56:36] <Bsadowski1>	 "wgBackendResponseTime":520" :O
[22:56:38] <legoktm>	 uh, I'm not actually sure how to kill a query
[22:56:42] <legoktm>	 Bsadowski1: known
[22:56:47] <Bsadowski1>	 ""wgHostname":"mw1370"}"
[22:56:48] <Bsadowski1>	 k
[22:57:04] <Urbanecm>	 legoktm: mind me updating the status? Or should I wait?
[22:57:12] <legoktm>	 please
[22:57:19] <RhinosF1>	 Bsadowski1: you can see ongoing pages on https://klaxon.wikimedia.org/ under recent.
[22:57:44] <Urbanecm>	 legoktm: does this work?
[22:57:49] <rzl>	 Urbanecm: lgtm, thanks
[22:57:53] <Urbanecm>	 np
[23:00:13] <mutante>	 there used to be https://wikitech.wikimedia.org/wiki/Query_killer
[23:00:23] <mutante>	 that kills queries over 60 s?
[23:00:27] <Urbanecm>	 legoktm: there are some docs about killing at https://wikitech.wikimedia.org/wiki/MariaDB#Long_running_queries
[23:03:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[23:03:47] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2781 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[23:04:07] <legoktm>	 Urbanecm: ty figured it out
[23:04:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:04:19] <Urbanecm>	 great
[23:08:36] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[23:09:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:09:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH)
[23:15:59] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7894 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[23:16:24] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:21:16] <rzl>	 we're still root-causing this in another channel, but we believe everything is currently mitigated -- please yell if you're still experiencing slowness :)
[23:25:12] <RhinosF1>	 rzl: seems ok. I'll shout if i hear anything. Thanks for the work.
[23:25:30] <RhinosF1>	 is there a task i can read in the morning if you do find out the cause
[23:25:39] * RhinosF1 likes to be nosey
[23:25:47] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10bd808) >>! In T269914#6835578, @Legoktm wrote: > We saw a bunch of these requests again today. The main problem is that ma...
[23:30:26] <rzl>	 RhinosF1: it may not be public right away but we'll share as much as we can :)
[23:31:12] <wikibugs>	 10SRE: sessionstore SSL cert CRIT in Icinga since > 6 days - https://phabricator.wikimedia.org/T275090 (10Dzahn)
[23:31:52] <RhinosF1>	 rzl: ack ty, if there's a task number i'll bookmark it