[00:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T0000). [00:00:05] legoktm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:19] (03PS1) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) [00:00:23] guess it's just me [00:00:50] (03CR) 10Legoktm: [C: 03+2] Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664649 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:00:53] (03CR) 10Legoktm: [C: 03+2] Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664650 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:01:01] (03CR) 10Legoktm: [C: 03+2] Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:02:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE [00:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:45] (03Merged) 10jenkins-bot: Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:04:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE [00:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:52] (03Merged) 10jenkins-bot: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664649 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:07:31] (03Merged) 10jenkins-bot: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664650 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:09:41] https://test.wikipedia.org/w/index.php?title=EasyTimeline&type=revision&diff=466643&oldid=466221 [00:09:45] it works :) [00:11:46] and on wmf.30 too, just tested in preview though [00:13:32] !log legoktm@deploy1001 Synchronized wmf-config/timeline.php: Set $wgTimelineFontDirectory (T274822) (duration: 01m 05s) [00:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:40] T274822: [EasyTimeline] No text included / no font rendered / displayed at all in PNG graph output - https://phabricator.wikimedia.org/T274822 [00:15:47] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 02s) [00:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:36] RECOVERY - Long running screen/tmux on centrallog1001 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [00:17:19] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 06s) [00:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:58] PROBLEM - Host cloudnet1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:27:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet [00:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:26] RECOVERY - Host cloudnet1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [00:31:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:56] !log mw1351 - powercycled [00:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:50] (03CR) 10Legoktm: "I explained how I tested this at T273521#6835886." [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [00:39:50] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [00:49:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1351.eqiad.wmnet [00:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:02] RECOVERY - Long running screen/tmux on mwdebug1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [00:56:28] (03PS1) 10Legoktm: mediawiki: Remove hhvm reference in mw-cgroup unit [puppet] - 10https://gerrit.wikimedia.org/r/664688 [00:57:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1345.eqiad.wmnet'] ` an... [00:58:35] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1345.eqiad.wmnet [00:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet [01:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet [01:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1351.eqiad.wmnet [01:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1345.eqiad.wmnet [01:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:16] (03PS1) 10Dzahn: mcrouter_wancache: move mcrouter proxy from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) [01:19:50] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) [01:23:55] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757) [01:26:50] yes [01:27:13] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) [01:37:29] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [01:39:17] !log mwdebug1001 - rebooting [01:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:15] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [01:41:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet [01:41:53] !log mwdebug1001 - back on buster and pooled [01:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:33] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [02:56:21] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:43] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:57] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 5.509 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:29:01] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:15] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:54:13] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:12:07] (03CR) 10Legoktm: arclamp: add excimer-real pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) (owner: 10Dave Pifke) [04:38:39] PROBLEM - Long running screen/tmux on centrallog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 12365, 7305491s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [04:43:47] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Papaul) @Jclark-ctr no good news, you will have to try to use a DVD. Thanks [04:50:38] 10SRE, 10serviceops, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle) [04:52:51] 10SRE, 10Traffic, 10serviceops, 10Sustainability (Incident Followup), 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Krinkle) [04:59:19] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 385054056912 and 446944 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:11:48] 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle) >>! In T272215#6755992, @jcrespo wrote: > More details are yet to be provided on the Incident report, I can help with that once the right... [05:14:44] 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Krinkle) [06:10:02] In around 50 minutes I will be restarting x1 master (daemon restart) [06:18:43] (03PS1) 10Marostegui: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361) [06:20:08] (03CR) 10Marostegui: [C: 03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:24:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this patch does what we want, but I would love to generalize what you did a bit (in the nginx configuration). That should be done " [puppet] - 10https://gerrit.wikimedia.org/r/664683 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [06:32:15] (03CR) 10Marostegui: "This was enabling notifications" [puppet] - 10https://gerrit.wikimedia.org/r/664710 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:35:29] (03PS1) 10Marostegui: instances.yaml: Add db1172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664723 (https://phabricator.wikimedia.org/T258361) [06:36:31] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664723 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1172 to dbctl, but not pooled yet T258361', diff saved to https://phabricator.wikimedia.org/P14385 and previous config saved to /var/cache/conftool/dbconfig/20210217-063915-marostegui.json [06:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:22] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:52:05] In around 10 minutes I will be restarting x1 master (daemon restart) [07:00:04] !log Restart db1103 (x1) primary master - T273758 [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:09] T273758: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 [07:01:26] !log Restart db1103 (x1) primary master DONE - T273758 [07:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:48] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:04:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [07:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:06] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 19 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:06:06] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) This was done: Master down: 07:00:09 Master up: 07:01:24 [07:06:44] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) [07:07:00] marostegui: I was about to ask, anything happening to db1103.eqiad.wmnet ? :D [07:07:11] yep, the restart :) [07:07:21] yes yes all good, thanks :) [07:07:29] thanks :** [07:13:14] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) Closing this as fixed: ` # mysql -e "select @@report_host" +--------------------+ | @@report_host | +--------------------+ | db1103.eqiad.wmnet |... [07:13:23] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) 05Open→03Resolved [07:16:27] !log Add x1 to orchestrator [07:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [07:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1172 in s8 for the first time - T258361', diff saved to https://phabricator.wikimedia.org/P14386 and previous config saved to /var/cache/conftool/dbconfig/20210217-072131-marostegui.json [07:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:37] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:22:22] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet [07:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:14] (03PS10) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [07:30:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet [07:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet [07:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet [07:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14387 and previous config saved to /var/cache/conftool/dbconfig/20210217-074107-marostegui.json [07:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:12] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:43:15] (03PS1) 10Ladsgroup: wikilabels: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/664752 (https://phabricator.wikimedia.org/T273673) [07:46:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The code looks ok, but I don't see an additional checking that the new metrics act like expected. You should add at least one test in medi" [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan) [07:48:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Before merging I'd check that the service can be restarted with no harm caused to live requests." [puppet] - 10https://gerrit.wikimedia.org/r/664688 (owner: 10Legoktm) [07:59:15] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: set vega.enabled: false by default [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron) [07:59:33] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: update home dashboard to Grafana 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/664555 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [08:00:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:02:34] (03PS1) 10Muehlenhoff: Add Georgina Burnett to wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/664772 (https://phabricator.wikimedia.org/T273780) [08:04:01] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) >>! In T274488#6835686, @Jclark-ctr wrote: > @fgiunchedi would you be ok with chassis swap using ms-be1018 recently decommissioned? Yes, please proceed [08:04:27] (03CR) 10Muehlenhoff: [C: 03+2] Add Georgina Burnett to wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/664772 (https://phabricator.wikimedia.org/T273780) (owner: 10Muehlenhoff) [08:04:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [08:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:13] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10fgiunchedi) 05Open→03Resolved LGTM, thank you @Papaul [08:06:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10MoritzMuehlenhoff) 05Open→03Resolved @georginaburnett-wmde : Your access has been enabled. I'm closing the task, please reopen if you run into any issues! [08:07:16] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [08:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:59] 10SRE, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10MoritzMuehlenhoff) Ack, this needs approval/discussion in the next SRE meeting since it would create a new access group. [08:11:02] 10SRE, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:12:06] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [08:13:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:16:12] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MoritzMuehlenhoff) @MattCleinman : Hi, in this case for access to Superset we don't need your SSH key (but in fact you need to be added to analytics-privatedata-users). This needs approval from the f... [08:16:19] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:23:16] (03CR) 10Jcrespo: "Andrew, this is galera, which only cloud use, so I don't have a say on it." [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [08:27:26] (03CR) 10Jcrespo: [C: 04-1] Openstack control node galera: send mariadb logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [08:27:48] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:37:38] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 117 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:37:52] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:40:03] 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10jcrespo) I personally don't feel capable neither to write proper docs, file follow ups nor to close it. When I said "more details are yet to be pr... [08:40:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10MoritzMuehlenhoff) @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you to analytics-privatedata-users: * Your m... [08:41:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14388 and previous config saved to /var/cache/conftool/dbconfig/20210217-084120-marostegui.json [08:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:27] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:44:09] !log upgrade es2020 es2021 es2022's kernel [08:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:39] 10SRE, 10serviceops, 10SRE-OnFire-Incident-Docs, 10Sustainability (Incident Followup): High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) >>! In T272215#6836259, @Krinkle wrote: >>>! In T272215#6755992, @jcrespo wrote: >> More details are yet to be provided on the Incident repor... [08:49:08] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 8 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:49:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:54:14] (03CR) 10Kosta Harlan: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:54:18] (03Abandoned) 10Kosta Harlan: linkrecommendation: Disable cron job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:56:27] 10SRE, 10GitLab, 10SRE-Access-Requests: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10Aklapper) [08:59:19] (03PS1) 10Marostegui: wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774 [09:05:30] (03CR) 10Muehlenhoff: [C: 03+2] Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 (owner: 10Muehlenhoff) [09:10:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan) [09:23:53] (03PS1) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315) [09:24:20] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:11] 10ops-eqiad, 10Analytics: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10elukey) [09:27:50] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:47] !log reboot dbstore100[3-5] for kernel upgrades [09:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:49] (stop replicas; stop mariadb instances; umount /srv; reboot; etc..) [09:34:26] elukey: do `swapoff -a` too, before reboot [09:34:36] (if it's not already too late) [09:35:46] kormat: too late for 1003 but I'll do it for the others thanks :) [09:35:59] reclaiming swap during reboot is sometimes paaainfully slow. i've had db machines hang for 20-30 minutes when all i wanted was a nice quick reboot [09:36:19] +1 right [09:36:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [09:37:59] (03PS1) 10Kormat: WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779 [09:40:02] (03CR) 10Kormat: [C: 03+1] wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774 (owner: 10Marostegui) [09:41:11] (03CR) 10Kormat: [C: 03+2] WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779 (owner: 10Kormat) [09:42:04] (03PS11) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [09:42:31] (03CR) 10jerkins-bot: [V: 04-1] Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [09:42:57] (03PS1) 10Muehlenhoff: Initial stub role for rootless Cumin [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) [09:43:32] (03Merged) 10jenkins-bot: WMFMariaDB: Display ip addresses properly. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664779 (owner: 10Kormat) [09:43:36] (03PS1) 10Jbond: cloud idp: update mapped attributes [puppet] - 10https://gerrit.wikimedia.org/r/664781 [09:43:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/664774 (owner: 10Marostegui) [09:44:03] (03CR) 10Volans: [C: 03+1] "If puppet compiler is happy go for it, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [09:44:30] (03CR) 10Jbond: [C: 03+2] cloud idp: update mapped attributes [puppet] - 10https://gerrit.wikimedia.org/r/664781 (owner: 10Jbond) [09:45:54] (03PS2) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315) [09:53:05] (03CR) 10Volans: [C: 04-1] "Unless I'm missing something I think there is a flaw in the current approach." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [09:55:18] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:55:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/28103/" [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [09:55:47] (03CR) 10Muehlenhoff: [C: 03+2] Initial stub role for rootless Cumin [puppet] - 10https://gerrit.wikimedia.org/r/664780 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [09:56:34] (03PS1) 10DCausse: [wdqs] disable fetching constraints [puppet] - 10https://gerrit.wikimedia.org/r/664782 (https://phabricator.wikimedia.org/T274982) [09:59:02] (03PS3) 10Jbond: P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782) [09:59:38] (03CR) 10Jbond: [C: 03+2] debug_host: calculate the correct realm [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/664598 (owner: 10Jbond) [10:01:15] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [10:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:38] PROBLEM - MariaDB Replica IO: x1 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:58] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:02:02] PROBLEM - MariaDB read only x1 on dbstore1005 is CRITICAL: Could not connect to localhost:3320 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:02:14] PROBLEM - MariaDB Replica SQL: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:26] PROBLEM - MariaDB Replica IO: staging on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:32] PROBLEM - MariaDB read only s8 on dbstore1005 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:02:36] PROBLEM - MariaDB read only staging on dbstore1005 is CRITICAL: Could not connect to localhost:3350 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:02:40] PROBLEM - MariaDB Replica IO: s8 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:42] PROBLEM - MariaDB read only s6 on dbstore1005 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:02:46] PROBLEM - MariaDB Replica SQL: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:50] PROBLEM - MariaDB Replica IO: s6 on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:52] elukey: ^ [10:02:56] PROBLEM - MariaDB Replica SQL: s6 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:57] I guess that's you rebooting? [10:02:58] PROBLEM - mysqld processes on dbstore1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:03:08] PROBLEM - MariaDB Replica SQL: x1 on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:18] marostegui: yep downtime expired probably [10:03:47] sorry for the noise [10:03:49] elukey: coool, nothing to worry about then :) [10:04:04] RECOVERY - MariaDB Replica IO: staging on dbstore1005 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:04:05] elu downtimes-expire-so-quickly-why-even-create-them key [10:04:18] RECOVERY - MariaDB read only staging on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 13s, read_only: False, event_scheduler: True, 15.61 QPS, connection latency: 0.002788s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:04:38] RECOVERY - mysqld processes on dbstore1005 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:05:01] kormat: I put half an hour for the three of them and of course I got distracted right after the last reboot :D [10:05:10] hehe [10:05:24] RECOVERY - MariaDB read only x1 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 98s, read_only: True, event_scheduler: True, 39.47 QPS, connection latency: 0.001820s, query latency: 0.000356s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:05:32] RECOVERY - MariaDB Replica SQL: staging on dbstore1005 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:05:43] marostegui: so reboots completed for dbstores :P [10:05:50] \o/ [10:05:50] RECOVERY - MariaDB read only s8 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 92s, read_only: True, event_scheduler: True, 6680.22 QPS, connection latency: 0.005290s, query latency: 0.000674s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:05:52] thanks [10:05:53] elukey: there's a terrible script on cumin1001, `/home/kormat/bin/reboot-host` [10:05:58] it'll eventually become a cookbook [10:06:00] RECOVERY - MariaDB Replica IO: s8 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:04] but might be worth looking at if you're doing much of this [10:06:04] RECOVERY - MariaDB read only s6 on dbstore1005 is OK: Version 10.4.15-MariaDB, Uptime 87s, read_only: True, event_scheduler: True, 4015.24 QPS, connection latency: 0.003376s, query latency: 0.000280s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:06:06] RECOVERY - MariaDB Replica SQL: s8 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:10] RECOVERY - MariaDB Replica IO: s6 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:14] (03PS3) 10Giuseppe Lavagetto: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 [10:06:16] RECOVERY - MariaDB Replica SQL: s6 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:18] kormat: I wanted to ask about it, definitely interested in working on a cookbook [10:06:28] RECOVERY - MariaDB Replica SQL: x1 on dbstore1005 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:36] RECOVERY - MariaDB Replica IO: x1 on dbstore1005 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:40] kormat: is it an attempt to nerd-snipe me I suppose :D [10:06:47] *it is [10:06:58] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:07:02] (03CR) 10Giuseppe Lavagetto: Add the add_user filter (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [10:07:11] (03PS12) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [10:07:14] I can fall into the trap if you review some reuse-partman recipe that I am going to send in a bit :D [10:07:39] elukey: i can neither confirm nor deny [10:08:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:01] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) 05Open→03Resolved a:03Joe Given the original issue is definitely resolved, there is no point in keeping this ta... [10:12:16] 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) a:03Joe [10:12:54] <_joe_> jouncebot: next [10:12:55] In 1 hour(s) and 47 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1200) [10:13:22] <_joe_> !log depooling mw1331 to perform some tests for T266855 [10:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:29] T266855: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 [10:19:19] (03CR) 10Jbond: [C: 03+1] "see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:19:48] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782) (owner: 10Jbond) [10:21:51] kormat re the reboot-host script you have perhaps yuo could review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/657102 to ensure it flexible enough to account for the use cases (cough volans) [10:23:23] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: set up conntrack sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/664785 (https://phabricator.wikimedia.org/T272963) [10:23:39] 10SRE, 10ops-eqiad, 10Analytics: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:23:47] 10SRE, 10Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10MoritzMuehlenhoff) p:05Triage→03High [10:24:20] 10SRE: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:25:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: set up conntrack sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/664785 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [10:26:19] jbond42: the pre/post scripts: do they run on the cumin hosts, or on the target hosts? [10:26:32] (i need both) [10:27:21] i guess i could shell out to cumin from the cumin hosts; it doesn't feel very clean though [10:27:45] kormat: what are you trying to do? [10:27:53] * volans missing context [10:27:56] volans: amazing things [10:27:59] kormat: pre_scripts is a list of scripts that run on the host. pre_action is a function that by default calls self._run_scripts(self.pre_scripts, hosts) so you could hook that to do somthing on cumin furst [10:28:00] ofc :D [10:28:06] volans: /home/kormat/bin/reboot_host on cumin1001 [10:28:35] it-s reboot-host :D [10:28:47] congrats, you got past the first guardian! [10:29:03] to roommate agreement?™ [10:29:09] jbond42: can you pass parameters to the scripts run on the target host? [10:29:21] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [10:29:28] kormat: why that's not a cookbook? [10:29:49] volans: it will be, eventually. currently too many pre-reqs aren't in place. [10:30:00] happy to chat/know what they are [10:30:21] <_joe_> high latency in codfw?? [10:30:36] kormat: short answer is yes depending on what it is exactly yuo need to do you would need to hook pre_scripts or pre_actions [10:30:43] <_joe_> what happened in codfw at 9:50? [10:31:24] <_joe_> oh I see [10:31:32] <_joe_> a ton of requests to mwdebug2002 [10:32:33] (03CR) 10ArielGlenn: "PCC looks as I expect: https://puppet-compiler.wmflabs.org/compiler1002/28104/" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [10:36:50] (03PS1) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [10:37:38] (03CR) 10Volans: [C: 04-1] "My first pass, sorry if I commented on something already discussed in the previous PSes." (0320 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:44:52] (03PS2) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [10:48:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw1331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 390 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:49:41] <_joe_> that is me [10:50:37] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: keepalived: use nopreempt option [puppet] - 10https://gerrit.wikimedia.org/r/664789 (https://phabricator.wikimedia.org/T272963) [10:51:40] (03PS3) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [10:52:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure [10:52:01] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure [10:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:24] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) This hasn't arrived yet, right? It would be useful to have one large capacity system for [[ https://phabricator.wikimedia.org/T267338 | next week's test ]], but this is unrelated... [10:52:39] (03CR) 10Volans: [C: 03+1] "Thanks John for the details, LGTM, there are just the two nit comments left, none is a blocker." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:53:02] (03PS4) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [10:54:04] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) Adding local dc ops on CC of this ticket- things would have to go really bad to needing him for this test (this should be a relatively boring proc... [10:54:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: keepalived: use nopreempt option [puppet] - 10https://gerrit.wikimedia.org/r/664789 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [10:55:34] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:56:20] 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) First observation I can make is that most requests are done by the math extension, and usually go in pairs... [10:56:49] (03CR) 10Elukey: "Kormat: if you have time, here some examples of hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey) [11:02:30] (03CR) 10Volans: [C: 03+1] "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:03:54] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:07] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:34] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:35] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:46] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:08:47] RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:27] RECOVERY - cassandra CQL 10.64.32.8:9042 on maps1009 is OK: TCP OK - 0.013 second response time on 10.64.32.8 port 9042 https://phabricator.wikimedia.org/T93886 [11:11:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [11:17:05] (03CR) 10Jbond: install_server/dhcp: dhcpd.conf include mechanism support machinery (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:17:15] (03CR) 10Kormat: [C: 04-2] "I have comments." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey) [11:17:33] ^ for when -1 isn't strong enough [11:19:52] kormat: I was counting on it yes [11:19:58] :D [11:21:08] should have used lsblk, +1 [11:22:43] thanks a lot for the feeback, going to work on it and fill my ignorance gaps [11:22:51] yw <3 [11:24:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14389 and previous config saved to /var/cache/conftool/dbconfig/20210217-112422-marostegui.json [11:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:28] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [11:32:11] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [11:34:39] 10SRE, 10Traffic, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:34:57] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:35:50] (03PS1) 10Giuseppe Lavagetto: P:services_proxy::envoy: add keepalive to restbase-https [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) [11:38:04] (03CR) 10JMeybohm: [C: 03+2] admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [11:42:39] PROBLEM - PHP opcache health on mwdebug1003 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:45:34] (03PS1) 10Phuedx: Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798) [11:45:36] (03PS1) 10Phuedx: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) [11:49:51] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:50:03] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) [11:50:12] (03PS2) 10Giuseppe Lavagetto: Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) [11:55:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netbox-dev2001.wikimedia.org [11:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2001.wikimedia.org [11:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1200). [12:00:04] phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:29] o/ [12:01:16] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: start conntrackd before keepalived [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963) [12:01:49] i can deploy today [12:01:56] (03CR) 10Urbanecm: [C: 04-1] vector: Enable search treatment AB test on test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:02:43] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:02:58] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:03:04] (03PS2) 10Phuedx: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) [12:03:22] thanks phuedx, sounds better :) [12:04:09] (03Merged) 10jenkins-bot: Revert "Revert "vector: Enable WVUI search on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664793 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:04:14] Urbanecm: fiwiki VPT has reports of T273317 resurfacing :/ [12:04:14] T273317: some users with access are unable to configure pending changes - https://phabricator.wikimedia.org/T273317 [12:04:25] Majavah: oh.... [12:04:42] will look after b&c [12:04:56] phuedx: pulled onto mwdebug1001, can you check? [12:04:59] and I can't stabilize on test2wiki either, https://test2.wikipedia.org/w/index.php?title=Special:Stabilization&page=16th_december [12:05:04] should I open that task or make a new one? [12:05:32] Majavah: i can stabilize there.. [12:05:37] ...but maybe that's because I'm a S? [12:05:40] Urbanecm: Tested. LGTM [12:05:44] thanks, syncing [12:06:40] (03CR) 10Phuedx: vector: Enable search treatment AB test on test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:06:58] Urbanecm: my test2wiki +sysop expired, that might explain it [12:07:06] (03CR) 10Urbanecm: [C: 03+2] vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:07:14] Majavah: ah [12:07:40] renewed [12:07:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7872251778b65cb03eb5457f1b901d208d514609: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 25s) [12:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:47] T259798: Deploy the new Vue.js search experience to the Beta-Cluster and Test Wikipedia - https://phabricator.wikimedia.org/T259798 [12:07:53] (03Merged) 10jenkins-bot: vector: Enable search treatment AB test on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664794 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:08:06] now I can stabilize [12:08:21] RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:08:34] phuedx: your second patch is at mwdebug1001 [12:08:42] Urbanecm: Thanks. Testing now [12:09:14] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:10:35] !log urbanecm@deploy1001 Synchronized dblists/desktop-improvements.dblist: 7872251778b65cb03eb5457f1b901d208d514609: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 09s) [12:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:48] Urbanecm: do you see the form on fiwiki? https://fi.wikipedia.org/w/index.php?title=Toiminnot:Vakauta_sivu&page=Coari for example [12:12:02] Majavah: I don't [12:12:12] but I'm also not a fiwiki editor [12:12:16] Urbanecm: LGTM. Thanks [12:12:18] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh conntrackd service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963) [12:12:22] thanks, syncing [12:14:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6eeee95e090408c8bd35d14c2f76e3afd8a59048: vector: Enable search treatment AB test on test wikis (T259798) (duration: 01m 08s) [12:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:08] T259798: Deploy the new Vue.js search experience to the Beta-Cluster and Test Wikipedia - https://phabricator.wikimedia.org/T259798 [12:14:11] and should be live :) [12:14:14] anything else phuedx ? [12:14:27] That's it for me. Thanks for deploying those changes Urbanecm [12:14:34] no problem :) [12:18:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh conntrackd service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/664800 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:20:11] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [12:21:21] (03PS1) 10Arturo Borrero Gonzalez: cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 [12:22:10] (03PS2) 10Awight: New 2FA device [puppet] - 10https://gerrit.wikimedia.org/r/662661 [12:26:05] (03PS3) 10Awight: New 2FA key for awight [puppet] - 10https://gerrit.wikimedia.org/r/662661 [12:26:29] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: ignore errors on ip token set [puppet] - 10https://gerrit.wikimedia.org/r/664806 [12:26:46] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [12:27:05] (03CR) 10Volans: [C: 04-1] "small typo, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez) [12:27:56] (03CR) 10Hnowlan: [C: 03+2] mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan) [12:29:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: ignore errors on ip token set [puppet] - 10https://gerrit.wikimedia.org/r/664806 (owner: 10Arturo Borrero Gonzalez) [12:38:01] (03PS1) 10Volans: pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 [12:38:03] (03PS1) 10Volans: fileio: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 [12:38:06] (03PS1) 10Volans: fileio: manage blocks of text in files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664811 [12:39:32] (03CR) 10Volans: "I did make the code but because we might not be using it right now not sure if it's worth to add or not. I didn't write the test that woul" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664811 (owner: 10Volans) [12:40:06] (03PS1) 10Muehlenhoff: Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812 [12:40:08] (03PS1) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) [12:40:10] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 20%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14390 and previous config saved to /var/cache/conftool/dbconfig/20210217-124015-root.json [12:40:17] (03PS1) 10Jbond: make ca_source optional [puppet] - 10https://gerrit.wikimedia.org/r/664814 [12:40:19] (03PS1) 10Jbond: puppet_compiler: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/664815 [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:21] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:49] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:41:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1172 is now being automatically pooled into s8 [12:41:33] (03CR) 10Volans: [C: 03+1] "LGTM I can't recall by memory if it needs any additional tweak in other files" [puppet] - 10https://gerrit.wikimedia.org/r/664812 (owner: 10Muehlenhoff) [12:42:31] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [12:42:38] (03CR) 10Noa wmde: [C: 03+1] Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [12:42:41] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:42:52] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:23] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:43:38] (03CR) 10jerkins-bot: [V: 04-1] Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812 (owner: 10Muehlenhoff) [12:45:02] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:13] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:04] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1146 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:48:54] (03CR) 10Jbond: [C: 03+2] make ca_source optional [puppet] - 10https://gerrit.wikimedia.org/r/664814 (owner: 10Jbond) [12:49:00] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/664815 (owner: 10Jbond) [12:49:57] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:50:11] (03PS2) 10Muehlenhoff: Add unprivileged Cumin master(s) to network constants [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840) [12:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 40%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14391 and previous config saved to /var/cache/conftool/dbconfig/20210217-125519-root.json [12:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:46] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:00] (03PS2) 10JMeybohm: admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254) [13:05:59] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:10] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:02] (03PS3) 10Muehlenhoff: Add unprivileged Cumin master(s) to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840) [13:10:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14392 and previous config saved to /var/cache/conftool/dbconfig/20210217-131022-root.json [13:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:03] I’ll deploy a quick config change since the deployment calendar looks nicely free at the moment [13:16:07] (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) [13:16:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [13:17:21] (03Merged) 10jenkins-bot: Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [13:19:47] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:664593|Enable Wikibase Repo ID generator rate limiting on Wikidata (T272032)]] (duration: 01m 11s) [13:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:52] T272032: Add rate limit for creating Item IDs - https://phabricator.wikimedia.org/T272032 [13:21:43] (03CR) 10Bartosz Dziewoński: [C: 04-1] "This was done as part of Iba56724e62720dc2e3bdfd0837e1ced4cb337586" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) (owner: 10DannyS712) [13:25:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 60%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14393 and previous config saved to /var/cache/conftool/dbconfig/20210217-132526-root.json [13:25:27] (03PS1) 10JMeybohm: initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254) [13:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:54] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:26:26] 10SRE, 10Traffic: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:28:32] !log installing libzstd security updates on Buster [13:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:58] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) [13:30:21] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:32] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:50] (03CR) 10Bartosz Dziewoński: "Please check if this looks right, these configs are wonky." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński) [13:31:32] (03PS1) 10Kormat: integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821 [13:31:47] (03PS1) 10Muehlenhoff: Add library hint for libzstd [puppet] - 10https://gerrit.wikimedia.org/r/664822 [13:35:16] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libzstd [puppet] - 10https://gerrit.wikimedia.org/r/664822 (owner: 10Muehlenhoff) [13:36:48] (03CR) 10Kormat: [C: 03+2] integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821 (owner: 10Kormat) [13:38:11] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/664824 [13:39:16] (03Merged) 10jenkins-bot: integration: Move common funcs to integration/utils.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664821 (owner: 10Kormat) [13:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 80%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14395 and previous config saved to /var/cache/conftool/dbconfig/20210217-134030-root.json [13:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:29] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:37] 10SRE, 10Traffic: validate or revert the new large_objects_cutoff & nule_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 (10CDanis) [13:41:55] (03PS2) 10CDanis: Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) [13:42:14] (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/28108/" [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis) [13:42:54] 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) Current status: - `integration-env` script created to build docker image, download & cache bin... [13:43:44] 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) What's not integration-tested yet: - db-compare - db-stop-in-sync - db-switchover [13:44:27] (03CR) 10Muehlenhoff: [C: 03+2] Add unprivileged Cumin master(s) to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/664812 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [13:46:57] (03PS2) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) [13:51:22] (03Abandoned) 10DannyS712: Activate DiscussionTools as a beta feature on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) (owner: 10DannyS712) [13:55:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14396 and previous config saved to /var/cache/conftool/dbconfig/20210217-135533-root.json [13:55:34] (03PS1) 10Alexandros Kosiaris: default kubernetes policies: Add staging-codfw, remove logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 [13:55:36] (03PS1) 10Kormat: README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829 [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:38] (03PS1) 10Alexandros Kosiaris: calico: Remove default-deny GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 [13:55:40] (03PS1) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 [13:56:55] 10SRE, 10Traffic, 10Patch-For-Review: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 (10CDanis) [14:06:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) >>! In T258413#6836429, @MoritzMuehlenhoff wrote: > @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you... [14:07:17] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:24] (03PS3) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) [14:07:44] (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [14:08:53] (03CR) 10JMeybohm: [C: 04-1] "Why is that any better?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris) [14:08:57] (03CR) 10jerkins-bot: [V: 04-1] profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [14:09:21] (03CR) 10JMeybohm: [C: 03+2] initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:11:17] (03Merged) 10jenkins-bot: initialize_cluster: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664818 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:11:20] (03CR) 10Vgutierrez: [C: 03+1] Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis) [14:12:25] (03CR) 10CDanis: [C: 03+2] Increase nuke_limit in upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/664824 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis) [14:16:24] (03CR) 10Alexandros Kosiaris: "> Patch Set 1: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris) [14:16:56] (03PS4) 10Muehlenhoff: profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) [14:18:08] (03PS1) 10CDanis: Revert "Increase nuke_limit in upload@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/664657 (https://phabricator.wikimedia.org/T275028) [14:18:59] (03CR) 10CDanis: [C: 03+2] Revert "Increase nuke_limit in upload@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/664657 (https://phabricator.wikimedia.org/T275028) (owner: 10CDanis) [14:19:01] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:42] (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [14:26:25] !log starting rolling restart of cp-upload@eqsin varnish-fe T275028 [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:30] T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 [14:27:08] (03PS1) 10Filippo Giunchedi: pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830 [14:29:44] (03CR) 10MSantos: [C: 03+1] tegola: remove image in favour of blubber-built image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [14:30:29] (03CR) 10Muehlenhoff: [C: 03+2] profile::base::cuminunpriv: Allow SSH access from unprivileged Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/664813 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [14:36:11] (03CR) 10JMeybohm: [C: 03+1] "Sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [14:40:50] (03PS1) 10Kormat: integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831 [14:40:55] (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [14:42:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28111/console" [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [14:45:16] (03CR) 10Kormat: [C: 03+2] integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831 (owner: 10Kormat) [14:46:36] (03PS5) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [14:49:23] (03Merged) 10jenkins-bot: integration: Allow skipping of checksumming of cached files. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664831 (owner: 10Kormat) [14:50:58] (03CR) 10Elukey: "Thanks a lot for following up! I am going to get another -2 probably but hopefully this time the code is better :)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey) [14:53:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] P:services_proxy::envoy: add keepalive to restbase-https [puppet] - 10https://gerrit.wikimedia.org/r/664791 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [14:58:04] (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [14:58:41] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10Ottomata) Approved. [15:00:27] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris) [15:03:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Ottomata) Hi @CBogen, do you need direct access to data in Hadoop and Hive, or will you just be using Superset to access that data via Presto / Druid? We've since... [15:03:10] 10SRE, 10SRE-Access-Requests: Deployment access for Gabriele Modena - https://phabricator.wikimedia.org/T275020 (10WDoranWMF) As @gmodena's manager I approve this access. [15:06:48] (03CR) 10DannyS712: [C: 03+1] pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 (owner: 10Volans) [15:08:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [15:09:05] (03CR) 10JMeybohm: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris) [15:10:43] (03Merged) 10jenkins-bot: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [15:12:07] (03CR) 10Kormat: [C: 03+1] "This... actually looks good to me. :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey) [15:12:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) >>! In T258413#6837821, @Ottomata wrote: > Hi @CBogen, do you need direct access to data in Hadoop and Hive, or will you just be using Superset to access t... [15:13:04] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet [15:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:00] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet [15:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:43] (03CR) 10Kormat: [C: 03+1] pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830 (owner: 10Filippo Giunchedi) [15:20:04] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet [15:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:27] RECOVERY - Check no envoy runtime configuration is left persistent on mw1331 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:24:32] (03CR) 10JMeybohm: [C: 03+1] Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [15:26:07] (03PS2) 10Arturo Borrero Gonzalez: cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 [15:26:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet [15:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:44] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] (03CR) 10MSantos: [C: 04-1] Add simple blubber image (031 comment) [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [15:31:34] !log uploaded jasper 1.900.1-debian1-2.4+deb8u6+wmf3 to apt.wikimedia.org [15:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:42] (03CR) 10MSantos: [C: 04-1] "Another question I have, isn't this repository supposed to be an upstream mirror? Adding custom functionality, like the CI pipeline, could" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [15:32:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris) [15:32:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris) [15:33:37] (03Merged) 10jenkins-bot: default kubernetes policies: Add staging-codfw, remove logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/664826 (owner: 10Alexandros Kosiaris) [15:34:13] (03Merged) 10jenkins-bot: calico: Remove default-deny GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664827 (owner: 10Alexandros Kosiaris) [15:34:57] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [15:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:10] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [15:35:20] (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [15:35:28] (03PS2) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 [15:36:54] !log T275028 rolling restart done; check for fetch failures once caches re-fill [15:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:58] T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 [15:39:26] (03CR) 10David Caro: [C: 03+1] cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez) [15:42:15] (03PS1) 10Muehlenhoff: Add nikkin and gmodena to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021) [15:42:47] (03CR) 10Jgiannelos: Add simple blubber image (031 comment) [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [15:43:55] (03CR) 10Muehlenhoff: [C: 03+2] Add nikkin and gmodena to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021) (owner: 10Muehlenhoff) [15:44:01] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) Hi @MoritzMuehlenhoff . Thanks for your help on this, I have managed to set up the config and ssh in. One question; I am not able to connect... [15:44:53] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) 05Resolved→03Open [15:45:05] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) [15:45:27] (03CR) 10ArielGlenn: "not this way, please add them to platform-engineering as they are members of that team, that's what I tried to say in the task." [puppet] - 10https://gerrit.wikimedia.org/r/664843 (https://phabricator.wikimedia.org/T275021) (owner: 10Muehlenhoff) [15:45:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Deployment access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T275021 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @nikkin: You have been added to the deployment group. [15:45:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Deployment access for Gabriele Modena - https://phabricator.wikimedia.org/T275020 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @gmodena : You have been added to the deployment group. [15:45:57] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) >>! In T274459#6834660, @thcipriani wrote: > Hi @Dzahn apologies if there's a format for these kinds of requests that I missed: am I missing any info or tags for this request? Ah ha! Fou... [15:47:38] (03PS1) 10Giuseppe Lavagetto: Release 3.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/664844 [15:51:11] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron l3: activate net.netfilter.nf_conntrack_tcp_be_liberal [puppet] - 10https://gerrit.wikimedia.org/r/664845 (https://phabricator.wikimedia.org/T268335) [15:51:32] (03PS1) 10Muehlenhoff: Move nikkin and gmodena to platform-engineering instead (which also grants deployment) [puppet] - 10https://gerrit.wikimedia.org/r/664846 [15:52:06] (03CR) 10jerkins-bot: [V: 04-1] Move nikkin and gmodena to platform-engineering instead (which also grants deployment) [puppet] - 10https://gerrit.wikimedia.org/r/664846 (owner: 10Muehlenhoff) [15:53:27] (03CR) 10JMeybohm: [C: 04-1] calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [15:54:50] (03PS3) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 [15:54:57] (03PS2) 10Muehlenhoff: Move nikkin and gmodena to platform-engineering [puppet] - 10https://gerrit.wikimedia.org/r/664846 [15:55:23] (03CR) 10Alexandros Kosiaris: calico: Specify a GlobalNetworkPolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [15:55:55] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission analytics10[42-57] - https://phabricator.wikimedia.org/T267932 (10wiki_willy) [15:55:58] (03CR) 10JMeybohm: [C: 03+1] calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [15:56:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28113/" [puppet] - 10https://gerrit.wikimedia.org/r/664845 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [15:58:15] (03CR) 10Muehlenhoff: [C: 03+2] Move nikkin and gmodena to platform-engineering [puppet] - 10https://gerrit.wikimedia.org/r/664846 (owner: 10Muehlenhoff) [15:58:38] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10elukey) Hi @ChristineDeKock, can you try to use `christinedk` as username and then the password of the wikitech credentials? [16:02:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [16:04:12] (03Merged) 10jenkins-bot: calico: Specify a GlobalNetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664828 (owner: 10Alexandros Kosiaris) [16:04:28] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 3.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/664844 (owner: 10Giuseppe Lavagetto) [16:04:35] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) @elukey I have tried this and it fails. [16:05:04] (03CR) 10Volans: [C: 03+1] "Looks ok to me, hard to tell if it will work at first try given the amount of things moved around. Thanks a lot Luca for the effort!" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [16:05:50] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided) [16:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:18] (03CR) 10Muehlenhoff: [C: 03+1] cumin: aliases: introduce alias for cloudgw-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez) [16:06:21] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided) (duration: 00m 30s) [16:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:31] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:30] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Ottomata) Hiya, Christine will need LDAP membership in the `nda` group for this access. @ChristineDeKock FYI, we are slowly working towards using Conda based... [16:18:31] 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) @Ottomata just one more question for you! >>! In T263496#6744142, @CDanis wrote: >>>! In T263496#6744057, @Ottomata wrote: >> The long term solution here is still not cl... [16:20:47] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:20:56] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1035.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [16:22:24] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 16 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:22:30] (03CR) 10Esanders: [C: 03+1] Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński) [16:23:10] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:18] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10MoritzMuehlenhoff) @ChristineDeKock : I've added you to the cn=nda LDAP group, can you please retry? [16:25:29] (03PS1) 10Urbanecm: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) [16:32:54] !log installing intel-microcode security updates on buster [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:20] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:39] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add ulogd ecs filter + tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [16:40:23] (03CR) 10Brennen Bearnes: [C: 03+1] "Nice - looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/664602 (owner: 10Ahmon Dancy) [16:40:59] (03CR) 10Dzahn: [C: 03+2] wikilabels: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/664752 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:41:33] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:22] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:49] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:44:05] (03PS1) 10Alexandros Kosiaris: calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851 [16:44:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:45:37] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:46:20] (03PS1) 10JMeybohm: envoy: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) [16:46:21] !log roll-restart logstash to apply ulogd filter - T234565 [16:46:24] (03PS1) 10JMeybohm: envoy-future: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664854 (https://phabricator.wikimedia.org/T274254) [16:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:27] T234565: Standardize the logging format - https://phabricator.wikimedia.org/T234565 [16:46:28] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:47:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851 (owner: 10Alexandros Kosiaris) [16:47:53] (03PS1) 10RobH: fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666) [16:48:18] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10CDanis) [16:49:05] (03Merged) 10jenkins-bot: calico: namespaceSelector for allow-all-icmp Global policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/664851 (owner: 10Alexandros Kosiaris) [16:49:52] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MattCleinman) Thanks! @Tnegrin is currently my manager. Will make sure he approves this ticket. [16:50:52] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) Thank you, it works! I now have access with username christinedk + my wikitech password, using the Newpyter instructions. [16:51:01] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [16:52:08] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) [16:52:11] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [16:52:54] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10Tnegrin) approved [16:53:58] (03CR) 10JMeybohm: "I tested this locally by now and it seems to work fine." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [16:54:04] (03CR) 10RLazarus: [C: 03+2] Add "minimum hits" support to logspam/logspam-watch [puppet] - 10https://gerrit.wikimedia.org/r/664602 (owner: 10Ahmon Dancy) [16:54:52] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) [16:55:23] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [16:56:58] (03CR) 10Volans: "Replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [16:56:58] the logstash alerts is me, should recover shortly [16:57:22] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:57:22] or not [16:57:22] (03CR) 10RLazarus: [C: 03+1] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [16:57:32] PROBLEM - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 #page on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:57:40] sorry that's going to page I think [16:57:43] yes :( [16:57:52] Hi [16:57:57] my bad [16:57:58] indeed [16:58:09] whew i was like [16:58:10] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [16:58:13] I'll revert [16:58:13] * volans ignoring as stated above [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:16] 'im installing new logstash, what did i do wrong' [16:58:35] godog: I'll prep a patch [16:58:36] (03PS1) 10Filippo Giunchedi: Revert "logstash: add ulogd ecs filter + tests" [puppet] - 10https://gerrit.wikimedia.org/r/664661 [16:58:50] (03PS2) 10RobH: fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666) [16:58:51] shdubsh: for the revert or the fix ? [16:58:58] godog: fix [16:59:04] * robh isnt merging that until outage is over no worries [16:59:12] i didnt mean to click rebase =P [16:59:17] am about to get on a call, but please ping me if you need an extra set of hands [16:59:19] robh: go ahead if you wish, that's fine [16:59:31] oh, i just didnt want to add to noise, my change is unrelated [16:59:35] apologies! [16:59:44] shdubsh: ack, thank you [17:00:23] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) [17:00:38] (03CR) 10RobH: [C: 03+2] fixing logstash103[345] partman [puppet] - 10https://gerrit.wikimedia.org/r/664855 (https://phabricator.wikimedia.org/T267666) (owner: 10RobH) [17:00:46] (03PS2) 10Elukey: sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 [17:00:49] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.4167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:01:09] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [17:02:45] 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) Let's do the former, I think doing the latter (using schemas to configure included data) is going to be the right solution after all. So, special case these headers ju... [17:02:46] (03CR) 10Elukey: sre.hosts.decommission: move to class API (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [17:03:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE [17:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:37] (03PS1) 10Bstorm: wikireplicas: fix the centralauth management bit of the view scripts [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) [17:04:12] (03PS1) 10Cwhite: profile: disable ulogd_ecs filter on legacy logstash [puppet] - 10https://gerrit.wikimedia.org/r/664861 [17:04:50] (03CR) 10Bstorm: "As we wind down the old replicas, we are going to want to condense some of the filtering steps in this." [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [17:05:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE [17:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:06:09] (03CR) 10Bstorm: "This is already tested via livehack, so I'll merge as soon as the jenkins job finishes." [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [17:06:53] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28114/console" [puppet] - 10https://gerrit.wikimedia.org/r/664861 (owner: 10Cwhite) [17:07:11] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:07:16] (03CR) 10Cwhite: [C: 03+2] profile: disable ulogd_ecs filter on legacy logstash [puppet] - 10https://gerrit.wikimedia.org/r/664861 (owner: 10Cwhite) [17:07:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1035.eqiad.wmnet'] ` Of which those **FAILED**: ` ['logstash1035.eqiad.wmnet'] ` [17:07:44] shdubsh: thank you, I'll merge etc [17:07:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:58] o11y folks: it looks like the VO incident didn't auto-resolve again, do y'all have a phab task tracking that? I see T264016 T266570 T263423 for individual cases, but not sure if there's anything for the overall issue [17:07:59] T266570: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 [17:07:59] T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016 [17:07:59] oh nevermind, I see you did that already [17:07:59] T263423: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423 [17:08:01] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [17:08:16] shdubsh: are you running puppet as well ? [17:08:18] heh, sorry. was quick on the button [17:08:27] yeah [17:09:11] rzl: there isn't an overall tracking task afaik no, which incident # tho ? [17:09:28] shdubsh: thanks! [17:09:47] godog: https://portal.victorops.com/ui/wikimedia/incident/809/details from just now [17:10:17] rzl: the alert didn't recover yet [17:10:18] oh! wait sorry, I saw a recovery and thought it was the page [17:10:19] never mind :) [17:10:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE [17:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:22] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1035.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [17:11:31] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [17:12:24] (03CR) 10Andrew Bogott: "ok, so catching up... given that everything is already sent to syslog for all DB servers (Galera and otherwise), this change will definite" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:12:28] (03Abandoned) 10Andrew Bogott: Openstack control node galera: send mariadb logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:12:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE [17:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:07] (03CR) 10Hnowlan: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:13:34] (03PS35) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:13:43] (03CR) 10Elukey: [C: 03+2] sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [17:13:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE [17:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] (03CR) 10David Caro: utils: add script to run docker ci tests locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [17:14:38] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-da [17:14:38] ar-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:14:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE [17:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:23] (03PS1) 10Cwhite: profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 [17:15:38] (03CR) 10jerkins-bot: [V: 04-1] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite) [17:15:58] (03PS2) 10Cwhite: profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 [17:16:24] (03CR) 10Bstorm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [17:16:34] 10SRE, 10SRE-Access-Requests: Deployment access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T275021 (10nnikkhoui) Thank you @MoritzMuehlenhoff ! And thank you @ArielGlenn for requesting :) [17:16:39] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:16:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE [17:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:09] (03Merged) 10jenkins-bot: sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [17:17:39] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite) [17:18:20] (03CR) 10Cwhite: [C: 03+2] profile: add content to filter_ulogd_ecs to appease config validator [puppet] - 10https://gerrit.wikimedia.org/r/664862 (owner: 10Cwhite) [17:20:17] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix the centralauth management bit of the view scripts [puppet] - 10https://gerrit.wikimedia.org/r/664860 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [17:21:48] (03CR) 10Hnowlan: "> Patch Set 1:" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [17:22:11] (03PS2) 10Hnowlan: Add simple blubber image [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 [17:22:11] ACKNOWLEDGEMENT - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB daniel_zahn need reboot but test ongoing? https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:22:11] ACKNOWLEDGEMENT - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB daniel_zahn need reboot but test ongoing? https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:22:23] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [17:25:06] (03PS1) 10Cwhite: profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 [17:25:28] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:25:39] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.825 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:26:31] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE [17:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:52] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite) [17:27:26] (03CR) 10Cwhite: [C: 03+2] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite) [17:27:32] (03PS2) 10Cwhite: profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 [17:27:36] (03CR) 10Cwhite: [V: 03+2 C: 03+2] profile: set content directy bypassing config validator check [puppet] - 10https://gerrit.wikimedia.org/r/664863 (owner: 10Cwhite) [17:27:40] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [17:28:31] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE [17:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:36] (03PS1) 10JMeybohm: prometheus-statsd-exporter: Run as nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664864 (https://phabricator.wikimedia.org/T274254) [17:28:38] (03PS1) 10JMeybohm: nutcracker: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664865 (https://phabricator.wikimedia.org/T274254) [17:29:24] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:29:45] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [17:29:48] RECOVERY - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 #page on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 11514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:29:49] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [17:29:59] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [17:29:59] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [17:30:23] VO auto-resolved after all \o/ sorry for doubting [17:31:03] heheh that's fair [17:31:37] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.887 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:31:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:34:15] (03PS2) 10Elukey: sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) [17:35:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1035.eqiad.wmnet'] ` and were **ALL** successful. [17:36:33] !log roll-restart logstash7 in codfw/eqiad to apply ulogd filters - T234565 [17:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:38] T234565: Standardize the logging format - https://phabricator.wikimedia.org/T234565 [17:36:51] (03CR) 10Elukey: [C: 03+2] sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:38:08] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [17:38:46] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:39:13] (03Merged) 10jenkins-bot: sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:39:18] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) [17:48:23] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use 'stein' post openstack eqiad1 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/664830 (owner: 10Filippo Giunchedi) [17:53:28] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] tegola: remove image in favour of blubber-built image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [17:54:11] (03CR) 10Hnowlan: [C: 03+2] role::maps: fix MOTD message [puppet] - 10https://gerrit.wikimedia.org/r/662659 (owner: 10Hnowlan) [17:58:55] 10SRE, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) (Sorry, just edited ^, somehow a very important 'not' did not make it through my typing fingers) [18:00:15] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) Thanks @MoritzMuehlenhoff! I have managed to set up the config and ssh in but I am not able to connect to JupyterLab (https://wikitech.wikimedia.org/wiki/... [18:00:30] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:01:32] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:05:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Ottomata) Hi Pablo! As a WMF employee you should be added to the `wmf` LDAP group. This should allow you to connect. [18:06:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans) [18:07:08] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn) [18:07:46] !log disable puppet on mw* in eqiad [18:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:54] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1275.eqiad.wmnet'] ` and were **ALL** s... [18:08:01] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10bd808) This bug and {T274090} are both semi-innocuous in that neither is currently causing IABot or the wikis to break, bu... [18:08:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1275.eqiad.wmnet [18:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:31] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1344.eqiad.wmnet'] ` and were **ALL** s... [18:08:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1344.eqiad.wmnet [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:09] (03PS1) 10Giuseppe Lavagetto: Add php 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664884 [18:09:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1343.eqiad.wmnet'] ` and were **ALL** s... [18:11:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1343.eqiad.wmnet [18:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:18] !log mw1350 - powercycled via mgmt [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:56] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn) [18:17:12] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) [18:17:46] (03CR) 10Dzahn: [V: 03+2 C: 03+2] mcrouter: move mcrouter proxy for codfw D3 to mw2273 [puppet] - 10https://gerrit.wikimedia.org/r/664857 (https://phabricator.wikimedia.org/T2457571) (owner: 10Dzahn) [18:18:29] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1350.eqiad.wmnet'] ` and were **ALL** s... [18:18:54] (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [18:19:29] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1350.eqiad.wmnet [18:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:08] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [18:20:17] !log ppchelko@deploy1001 Started deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855 [18:20:17] (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [18:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 [18:21:06] (03CR) 10Elukey: [C: 03+1] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [18:21:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1275.eqiad.wmnet [18:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1343.eqiad.wmnet [18:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1344.eqiad.wmnet [18:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:41] !log enable puppet on mw* [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1350.eqiad.wmnet [18:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:24] (03CR) 10Herron: [C: 03+2] kibana: set vega.enabled: false by default [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron) [18:34:29] (03CR) 10Hashar: [C: 04-1] "I have learned there is an Apereo CAS test instance on WMCS reachable at idp.wmcloud.org . Some infos at T274461#6835716" [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [18:38:51] (03CR) 10Bstorm: "Does that address your concerns Alex?" [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [18:40:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855 (duration: 19m 54s) [18:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:16] T242855: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 [18:47:27] Pchelolo: \o/ !!! Thanks! [18:47:44] akosiaris: sorry it took so long, I forgot about it [18:47:58] I believe now your puppet patch is safe [18:48:36] Pchelolo: I 've waited so long for undeploying the service that I did not even notice ;) [18:48:50] ok will proceed with the undeploy tomorrow EU morning then [18:54:05] congrats on killing graphoid [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1900). [19:00:04] DannyS712 and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:04] marxarelli and twentyafterfour: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T1900). [19:00:12] here [19:00:24] i can deploy today [19:00:32] i gave just no-op cleanup patches [19:00:39] have* [19:01:29] ack [19:02:08] DannyS712: your patch is WIP [19:03:14] (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:03:17] (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:03:19] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [19:03:22] Urbanecm fixed [19:03:28] thx [19:04:12] MatmaRex: do you want me to pull them onto a mwdebug host, once they merge? [19:04:13] (03Merged) 10jenkins-bot: Remove uses of removed VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:04:35] (03PS2) 10Urbanecm: Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:04:39] Urbanecm: there's nothing to test, unless i made a typo or something [19:04:39] (03CR) 10Urbanecm: [C: 03+2] Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:04:45] MatmaRex: ack [19:05:34] (03Merged) 10jenkins-bot: Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [19:05:41] syncing CS.php [19:05:57] (03PS3) 10Urbanecm: Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [19:06:02] (03CR) 10Urbanecm: [C: 03+2] Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [19:06:51] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 6ac78bd2aa601db537f821c89b447c04927af422: Remove uses of removed VisualEditor config variables (T273177; 1/2) (duration: 01m 14s) [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:56] T273177: Remove unused config options VisualEditorNewAccountEnableProportion and VisualEditorAutoAccountEnable - https://phabricator.wikimedia.org/T273177 [19:07:21] and syncing IS.php [19:07:25] (03Merged) 10jenkins-bot: Enable GlobalWatchlist extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663068 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [19:07:28] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:08:06] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:08:21] rzl: ^ I did not fix it but I am glad it is [19:08:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6ac78bd2aa601db537f821c89b447c04927af422: Remove uses of removed VisualEditor config variables (T273177; 2/2) (duration: 01m 07s) [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:37] MatmaRex: deployed [19:08:45] thanks [19:08:52] mutante: it may be related to me doing deployments (which clears opcache in some cases iirc) [19:09:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10nettrom_WMF) >>! In T258413#6837869, @CBogen wrote: >>>! In T258413#6837821, @Ottomata wrote: >> Hi @CBogen, do you need direct access to data in Hadoop and Hive,... [19:09:22] rzl: issues were found with mcrouter connecting to mcrouter proxies.. if they are mixed stretch-buster, but not for stretch-stretch or buster-buster, so the fix for that is upgrading all to buster [19:09:23] DannyS712: your patch is available at mwdebug1001, please test [19:09:40] Urbanecm: I think I was missing an extra reboot after reinstall maybe.. that would have cleared it [19:09:54] but then did not want to do that because others were working [19:10:03] now it's just resolved anyways [19:10:08] i see :) [19:10:26] and thanks for the explanation in the list mutante [19:10:35] as long as scap updates the code there, I'm good :) [19:11:32] Urbanecm the special page loads, but "Skipped unresolvable module ext.globalwatchlist.specialglobalwatchlist" so it doesn't work. I'm guessing thats because the load.php request is going to a different server than mwdebug1001 - it works fine on testwiki [19:11:54] the settings page loads, but again without the styles module [19:12:01] i would blame RL cache [19:12:31] let's sync it [19:12:33] why would load.php not go to mwdebug? it's from your browser after all [19:12:41] Urbanecm: ACK, yes, it should get scap updates:) in a couple weeks (?) I will ask you if we still need mwdebug1003 [19:12:51] sure :) [19:13:29] (03PS1) 10AndyRussG: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) [19:14:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 352dd72c28462755546ac36a017548a7f0925df0: Enable GlobalWatchlist extension on metawiki (T260862) (duration: 01m 07s) [19:14:22] DannyS712: and...done! [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] T260862: Deploy GlobalWatchlist extension to production (Meta only) - https://phabricator.wikimedia.org/T260862 [19:14:40] anything else? [19:14:40] it works! [19:14:44] cool [19:14:46] thanks so much [19:14:48] woo! [19:16:49] (03PS2) 10AndyRussG: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) [19:17:25] (03CR) 10AndyRussG: [C: 04-1] Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [19:17:28] (03PS1) 10Urbanecm: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) [19:17:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1033.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [19:17:44] (03PS2) 10Urbanecm: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) [19:17:48] (03CR) 10Urbanecm: [C: 03+2] tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) (owner: 10Urbanecm) [19:18:48] oh, global watchlist [19:19:19] (03PS1) 10Urbanecm: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) [19:19:21] hi tabbycat [19:19:30] meow Urbanecm [19:19:58] (03Merged) 10jenkins-bot: tlwikibooks: Add WB as an alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664890 (https://phabricator.wikimedia.org/T274977) (owner: 10Urbanecm) [19:20:16] RECOVERY - dhclient process on sretest1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:20:33] (03PS2) 10Urbanecm: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) [19:20:37] (03CR) 10Urbanecm: [C: 03+2] tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) (owner: 10Urbanecm) [19:21:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a7eb726f01ab5332d8b8951fdd0fa0c5a9459d4c: tlwikibooks: Add WB as an alias to NS_PROJECT (T274977) (duration: 01m 09s) [19:21:42] (03Merged) 10jenkins-bot: tlwikibooks: Add Wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664891 (https://phabricator.wikimedia.org/T274976) (owner: 10Urbanecm) [19:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:48] T274977: Addition of the WB: namespace alias in Tagalog Wikibooks - https://phabricator.wikimedia.org/T274977 [19:22:22] mutante: ahh nod [19:24:20] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=tlwikibooks --fix # T274977 # P14403 [19:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:58] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Do these machines just have to talk to each other (on what port/protocol btw?) or does it _really_ require that they are in wikimedia.org directly exposed to the Internet and without any cachi... [19:26:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c37fa0115113fb31cb54d9cf3f18a13f656c73dd: tlwikibooks: Add Wikijunior namespace (T274976) (duration: 01m 09s) [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:31] T274976: Addition of the Wikijunior: namespace in Tagalog Wikibooks - https://phabricator.wikimedia.org/T274976 [19:26:54] DannyS712: wheee congrats!! [19:27:21] 11:11:32 Urbanecm the special page loads, but "Skipped unresolvable module ext.globalwatchlist.specialglobalwatchlist" so it doesn't work. I'm guessing thats because the load.php request is going to a different server than mwdebug1001 - it works fine on testwiki <-- shouldn't be possible, X-Wikimedia-Debug will route all requests, including load.php to the mwdebug server [19:27:36] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=tlwikibooks --fix # T274976 # P14404 [19:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:49] indeed, it worked for me when i tried it myself shortly before syncing [19:27:53] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Machines that are directly exposed to the Internet and are managed manually are more of a challenge to security practices than internal machines I would think. [19:27:54] (incl. style module) [19:28:56] i think RL just needed some time to notice the new module [19:32:22] (03PS1) 10Urbanecm: arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844) [19:32:29] (03CR) 10Urbanecm: [C: 03+2] arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844) (owner: 10Urbanecm) [19:32:43] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE [19:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:38] (03Merged) 10jenkins-bot: arbcom_ruwiki: Add arbcom user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664892 (https://phabricator.wikimedia.org/T274844) (owner: 10Urbanecm) [19:33:47] * legoktm nods [19:34:48] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE [19:34:50] &debug=1 probably would've solve it [19:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:09] (03PS1) 10Urbanecm: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) [19:36:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6c5c5f0d1b83a7f05272f133c269c740af8352db: arbcom_ruwiki: Add arbcom user group (T274844) (duration: 01m 12s) [19:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:20] T274844: arbcom-ru.wikipedia.org: add rights to bureaucrats usergroup - https://phabricator.wikimedia.org/T274844 [19:36:40] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) regarding the request for 24GB of RAM: This would make these the VMs with the most memory globally... more than _anything_ else. To give you an idea .. all existing ganeti VMs are between 1... [19:37:48] (03PS1) 10Urbanecm: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) [19:38:43] (03PS2) 10Urbanecm: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) [19:38:49] (03CR) 10Urbanecm: [C: 03+2] hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [19:39:56] (03Merged) 10jenkins-bot: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664893 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [19:41:34] (03PS2) 10Urbanecm: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) [19:41:38] (03CR) 10Urbanecm: [C: 03+2] hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [19:41:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1033.eqiad.wmnet'] ` and were **ALL** successful. [19:42:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 88e6ebc5565a7a0b1431dd5f52c701d8df641990: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import (T274796) (duration: 01m 09s) [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:32] T274796: Several permission changes for he.wikisource - https://phabricator.wikimedia.org/T274796 [19:42:44] (03Merged) 10jenkins-bot: hewikisource: Allow reviewers to rollback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664894 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [19:44:37] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:44:42] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) [19:45:31] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw B3 to mw2258 [puppet] - 10https://gerrit.wikimedia.org/r/664856 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:45:46] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2e521f76c195ab50ab28a7d4812a35ceac246907: hewikisource: Allow reviewers to rollback (T274796) (duration: 01m 10s) [19:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) [19:49:04] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:49:11] * Urbanecm done [19:49:46] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:49:52] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368 [puppet] - 10https://gerrit.wikimedia.org/r/664692 (https://phabricator.wikimedia.org/T245757) [19:54:50] (03CR) 10Gehel: [C: 04-1] "one last comment inline (well, a few, but only the one about default attribute is really important)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [19:56:26] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:57:39] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:57:46] (03PS3) 10Dzahn: mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271 [puppet] - 10https://gerrit.wikimedia.org/r/664690 (https://phabricator.wikimedia.org/T245757) [19:58:15] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Urbanecm) Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can you use a cloud-provided VM instead? [20:00:04] marxarelli and twentyafterfour: Dear deployers, time to do the Mediawiki train - American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T2000). [20:05:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Ottomata) Ok, so just `wmf` LDAP and `analytics-privatedata-users` posix membership is needed. Thank you. [20:06:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Abit) > In T258413#6836429, @MoritzMuehlenhoff wrote: > @CBogen : Hi, this needs approval from the following people. Once those are done on task, I'll add you to a... [20:07:30] (03PS1) 10Dzahn: mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288 [puppet] - 10https://gerrit.wikimedia.org/r/664898 (https://phabricator.wikimedia.org/T245757) [20:12:13] (03PS1) 10Dduvall: group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899 [20:12:15] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899 (owner: 10Dduvall) [20:13:07] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664899 (owner: 10Dduvall) [20:14:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` logstash1034.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [20:15:53] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.31 [20:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:09] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.31 (duration: 01m 15s) [20:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:18] (03PS1) 10Dzahn: admin: create new group for gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/664902 (https://phabricator.wikimedia.org/T274953) [20:22:48] wmf.31 seems pretty quiet [20:23:05] !log 1.36.0-wmf.31 rolled to group1. no new errors for wmf.31 (T271345) [20:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:11] T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345 [20:24:35] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [20:27:15] (03PS1) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) [20:28:40] (03PS1) 10Dzahn: create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) [20:29:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE [20:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:45] (03CR) 10Dzahn: "The admin group this uses would be created in https://gerrit.wikimedia.org/r/c/operations/puppet/+/664902" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [20:30:11] (03CR) 10Dzahn: "the placeholder role this needs even if nothing else is puppetized .. would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/664904" [puppet] - 10https://gerrit.wikimedia.org/r/664902 (https://phabricator.wikimedia.org/T274953) (owner: 10Dzahn) [20:30:15] (03CR) 10jerkins-bot: [V: 04-1] create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [20:31:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE [20:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:54] (03PS2) 10Dzahn: create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) [20:38:18] (03PS4) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [20:39:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1034.eqiad.wmnet'] ` and were **ALL** successful. [20:39:52] (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:40:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) [20:41:46] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Joe) >>! In T274459#6839105, @Urbanecm wrote: > Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can... [20:45:21] (03CR) 10Giuseppe Lavagetto: [C: 04-2] ""These VMs will not be puppetized" needs a thorough discussion." [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [20:46:56] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) 05Open→03Resolved Ok, these are all setup and imaged, staged and ready for subteam takeover. [20:54:15] (03CR) 10Wolfgang Kandek: "Puppetization is planned to happen once we have SREs hired and they take over from the contractors. We estimate 6 months before that can h" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [21:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T2100). [21:06:38] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [21:06:43] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [21:06:50] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for codfw C3 to mw2337 [puppet] - 10https://gerrit.wikimedia.org/r/664859 (https://phabricator.wikimedia.org/T245757) [21:22:00] (03PS1) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) [21:29:10] (03PS2) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) [21:42:57] (03PS1) 10Andrew Bogott: Horizon: ship error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/664934 (https://phabricator.wikimedia.org/T268175) [21:47:04] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [21:47:54] (03PS5) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [21:47:57] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) a:03Jclark-ctr [21:48:17] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [21:49:35] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: ship error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/664934 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [22:02:36] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [22:04:20] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:08:36] (03CR) 10Thcipriani: "> Patch Set 2: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [22:11:32] (03CR) 10Razzi: [C: 03+1] "Looks good, could do a bit of code cleanup" (032 comments) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [22:27:17] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2035 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [22:27:29] hi :( [22:27:33] again [22:27:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:27:59] here if needed but likely others better suited about. [22:28:15] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools' replytool available for everyone on gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664940 (https://phabricator.wikimedia.org/T258554) [22:28:38] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [22:28:43] same pattern as yesterday except that we can certainly rule out the 304s as unrelated, if anyone was unsure [22:29:10] (03PS1) 10Urbanecm: hewikisource: Allow sysops to grant/revoke reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664941 (https://phabricator.wikimedia.org/T274796) [22:29:34] (03CR) 10Ottomata: Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. (032 comments) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [22:30:22] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:30:25] that is something [22:30:47] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6258 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [22:31:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:32:55] it's roughly the same time as yesterday too [22:33:31] yeah, about an hour later [22:33:56] (03CR) 10Aklapper: "Thanks for the quick merge! Nah, no pasting of results needed." [puppet] - 10https://gerrit.wikimedia.org/r/664002 (https://phabricator.wikimedia.org/T274711) (owner: 10Aklapper) [22:34:11] (03PS1) 10Andrew Bogott: lookup_table_output.json: Send horizon logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/664943 (https://phabricator.wikimedia.org/T268175) [22:34:15] There's some replag alerts though today according to -databases that I don't remember going off yesterday but happened around same time as the page [22:35:30] active worker count is still elevated, I wouldn't be surprised if this pages again, still digging [22:36:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:36:30] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:36:31] (03CR) 10Andrew Bogott: [C: 03+2] lookup_table_output.json: Send horizon logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/664943 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [22:36:52] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use. >>! In T274459#6839006, @Dzahn wrote: > regarding... [22:38:08] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:38:15] 10SRE, 10GitLab, 10vm-requests: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) [22:38:16] rzl: what are you looking through? [22:38:53] legoktm: atm just dashboards -- sshing to an api server now to poke through logs and see if anything stands out [22:39:16] legoktm: if this is happening always around the same time, I'd see if there's some maintenance.pp cron behind :) [22:39:25] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2604 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [22:40:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [22:41:14] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:41:59] tabbycat: I don't see anything active right now [22:42:03] just background parser cache purge [22:42:22] and extensions/MediaModeration/maintenance/ModerateExistingFiles.php [22:42:51] MediaModeration, that's new to me [22:42:59] legoktm: when did that start [22:43:16] when did what start? [22:43:24] Because swift did come up yesterday iirc and swift = files in the most basic sense [22:43:30] legoktm: the moderate files script [22:44:42] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 29 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:45:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:45:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:46:11] legoktm: everything in the slowlog looks like it's hanging in DB calls, and s1 open connections are spiking, same as yesterday https://grafana.wikimedia.org/goto/BYO6_XPGz [22:49:37] can we look at the queries themselves? [22:49:51] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5424 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [22:50:03] a DBA would know how :P I'm trying to avoid having to wake one up, but will if I can't figure this out before too long [22:50:15] https://tendril.wikimedia.org/activity?research=0&labsusers=0 [22:50:32] ah cheers [22:51:23] is it okay if we just kill those queries? [22:51:36] they're stacking up too [22:51:58] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:52:31] legoktm: sounds good to me [22:52:54] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [22:53:19] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1208 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [22:53:24] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1308 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:56:00] CSS is loading slowly [22:56:25] legoktm: rzl: do we know which queries cause this? We can try to disable the feature that generates them if needed [22:56:35] Bsadowski1: id guess expected [22:56:36] "wgBackendResponseTime":520" :O [22:56:38] uh, I'm not actually sure how to kill a query [22:56:42] Bsadowski1: known [22:56:47] ""wgHostname":"mw1370"}" [22:56:48] k [22:57:04] legoktm: mind me updating the status? Or should I wait? [22:57:12] please [22:57:19] Bsadowski1: you can see ongoing pages on https://klaxon.wikimedia.org/ under recent. [22:57:44] legoktm: does this work? [22:57:49] Urbanecm: lgtm, thanks [22:57:53] np [23:00:13] there used to be https://wikitech.wikimedia.org/wiki/Query_killer [23:00:23] that kills queries over 60 s? [23:00:27] legoktm: there are some docs about killing at https://wikitech.wikimedia.org/wiki/MariaDB#Long_running_queries [23:03:22] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [23:03:47] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2781 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [23:04:07] Urbanecm: ty figured it out [23:04:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:04:19] great [23:08:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:09:06] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:09:31] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [23:15:59] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7894 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [23:16:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:21:16] we're still root-causing this in another channel, but we believe everything is currently mitigated -- please yell if you're still experiencing slowness :) [23:25:12] rzl: seems ok. I'll shout if i hear anything. Thanks for the work. [23:25:30] is there a task i can read in the morning if you do find out the cause [23:25:39] * RhinosF1 likes to be nosey [23:25:47] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10bd808) >>! In T269914#6835578, @Legoktm wrote: > We saw a bunch of these requests again today. The main problem is that ma... [23:30:26] RhinosF1: it may not be public right away but we'll share as much as we can :) [23:31:12] 10SRE: sessionstore SSL cert CRIT in Icinga since > 6 days - https://phabricator.wikimedia.org/T275090 (10Dzahn) [23:31:52] rzl: ack ty, if there's a task number i'll bookmark it