[00:00:34] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:06] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:20] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:22] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:22] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:34] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47596872 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:35:26] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 252056 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:40] (03CR) 10Huji: [C: 03+1] "Great. This will have to wait until after July 2nd (which is when the wmf.39 will reach Wikipedias). I will try to schedule it for July 6t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608222 (https://phabricator.wikimedia.org/T255506) (owner: 10ProcrastinatingReader) [04:31:32] (03PS1) 10Marostegui: mariadb: Move db1135 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/608256 (https://phabricator.wikimedia.org/T253217) [04:37:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1135 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/608256 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [04:46:46] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) p:05Triage→03Medium @herron any idea how big these DBs can be and how many writes we'd be expecting? Which grants would be needed? I would assume we do need backups, r... [04:53:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [04:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:08] !log Stop MySQL on db1080 to clone db1135 T253217 [04:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:12] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [05:43:57] (03CR) 10Ammarpad: [C: 03+1] Set $wgForceUIMsgAsContentMsg for Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) (owner: 10Hamish) [06:04:28] 10Operations, 10CX-cxserver, 10Citoid, 10Core Platform Team, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Physikerwelt) I do not really understand what needs to be done within mathoid. Mathoid has two [dependencies](https://github.com/wikimedi... [06:29:26] PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:40] PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:10] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:28] PROBLEM - ores uWSGI web app on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:33:06] PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:35:38] !log force puppet run on ores* to overcome celery OOMs on some nodes [06:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:42] RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:36:56] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:24] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:56] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:41:08] (03PS1) 10Elukey: Decommission an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/608258 (https://phabricator.wikimedia.org/T256363) [06:44:39] (03CR) 10Elukey: [C: 03+2] Decommission an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/608258 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [06:45:04] !log Deploy MCR schema change on db1090:3312 [06:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:45] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [06:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:15] !log execute gnt-instance remove an-launcher1001.eqiad.wmnet on ganeti1011 - T256363 [06:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:19] T256363: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 [06:50:41] (03PS1) 10Marostegui: instances.yaml: Remove db1080, add db1135 [puppet] - 10https://gerrit.wikimedia.org/r/608259 (https://phabricator.wikimedia.org/T253217) [06:51:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:51:57] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1080, add db1135 [puppet] - 10https://gerrit.wikimedia.org/r/608259 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [06:52:50] does the decom cookbook also take care of removing the instance from Ganeti? I see in the output "shutdown the VM", but IIRC in other tasks I had to specifically remove the VM as well [06:52:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1080 from MW', diff saved to https://phabricator.wikimedia.org/P11682 and previous config saved to /var/cache/conftool/dbconfig/20200629-065335-marostegui.json [06:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:14] ah yes super nice [06:56:18] (03PS1) 10Elukey: Remove an-launcher1001's records [dns] - 10https://gerrit.wikimedia.org/r/608260 (https://phabricator.wikimedia.org/T256363) [07:00:12] (03CR) 10Elukey: [C: 03+2] Remove an-launcher1001's records [dns] - 10https://gerrit.wikimedia.org/r/608260 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [07:01:30] (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks good. Let's just ditch the broken tests." [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [07:08:51] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) > mailman3web will have the emails That is more concerning, not because it is not doable, but because with attachments, the other database storing organization's emails on a... [07:12:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085', diff saved to https://phabricator.wikimedia.org/P11683 and previous config saved to /var/cache/conftool/dbconfig/20200629-071236-marostegui.json [07:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:29] !log Deploy schema change on db1085 with replication to labs T253276 [07:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:36] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [07:16:01] !log push new pfw firewall rules - T256170 [07:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline technical comments (rest of the change LGTM), but my biggest one isn't technical." (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [07:37:04] (03CR) 10Marostegui: [C: 03+1] mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [07:38:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:39:04] (03CR) 10Kormat: [C: 03+2] mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [07:40:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:46:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1135 (depooled) to s1 T253217', diff saved to https://phabricator.wikimedia.org/P11684 and previous config saved to /var/cache/conftool/dbconfig/20200629-074611-marostegui.json [07:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:17] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [07:46:57] (03CR) 10DCausse: [C: 04-1] [wdqs] add a new streaming updater profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [07:49:56] (03CR) 10Ayounsi: [C: 03+2] Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800 (owner: 10Faidon Liambotis) [07:50:22] (03Merged) 10jenkins-bot: Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800 (owner: 10Faidon Liambotis) [07:51:55] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:52:02] (03PS1) 10Marostegui: db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608265 (https://phabricator.wikimedia.org/T253217) [07:59:06] (03PS4) 10DCausse: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 [07:59:08] (03PS20) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [08:01:05] (03CR) 10Marostegui: [C: 03+2] db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608265 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [08:02:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11685 and previous config saved to /var/cache/conftool/dbconfig/20200629-080253-marostegui.json [08:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:59] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [08:03:43] !log prometheus eqiad -- lvextend --resizefs --size +200G vg-ssd/prometheus-ops [08:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 23555 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:04:52] !log add term selected-paths to policy BGP_IXP_in on all routers [08:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:22] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:16:10] (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [08:18:30] (03PS2) 10Majavah: multiversion: Fix 'closed-labs' reading as 'closed' for static config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608249 (https://phabricator.wikimedia.org/T109157) (owner: 10Krinkle) [08:19:29] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:22:16] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) The issue happened again on cp4025 and a few other nodes. It looks like a deadlock in `librdkafka` to me, the process is spinning on `pthread_cond_wai... [08:24:04] (03CR) 10Ema: [C: 03+2] purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [08:24:13] !log Deploy schema change on s2 codfw (lag will show up) T253276 [08:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:17] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [08:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11686 and previous config saved to /var/cache/conftool/dbconfig/20200629-082635-marostegui.json [08:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:39] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [08:30:18] if the new purged alert (https://gerrit.wikimedia.org/r/608019) works, we should be getting some criticals [08:30:47] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [08:33:29] !log cp1087, cp2033, cp2037, cp2039: repool after spending (way) more than 24h depooled T256444 [08:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:35] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [08:35:38] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: cluster=cache_upload instance=cp4025 job=purged site=ulsfo topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [08:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11687 and previous config saved to /var/cache/conftool/dbconfig/20200629-083631-marostegui.json [08:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:36] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [08:36:38] !log cp4025: restart purged T256444 [08:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:12] (03CR) 10Marostegui: "Have you done a quick check of those queries to see if the results are ok?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [08:38:36] marostegui: i'm currently looking at the UNKNOWN status on icinga for this alert [08:38:50] kormat: ack [08:38:52] thanks [08:39:15] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [08:40:25] !log cp2034: restart purged T256444 [08:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:30] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [08:47:32] (03CR) 10Vgutierrez: [C: 03+1] Serialize access to KafkaReader.maxts [software/purged] - 10https://gerrit.wikimedia.org/r/608045 (https://phabricator.wikimedia.org/T256479) (owner: 10Ema) [08:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11688 and previous config saved to /var/cache/conftool/dbconfig/20200629-084827-marostegui.json [08:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:32] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [08:52:55] (03PS21) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [08:53:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2040 is CRITICAL: cluster=cache_upload instance=cp2040 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2040 [08:54:08] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [08:55:48] (03PS3) 10Kormat: install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768) [08:57:22] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2040 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2040 [08:58:19] (03PS1) 10Marostegui: db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608267 (https://phabricator.wikimedia.org/T254462) [08:58:27] (03CR) 10Jcrespo: [C: 03+1] "We should test it when new dbprov hosts are installed- we cannot test it on current dbprov hosts." [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768) (owner: 10Kormat) [08:58:35] (03CR) 10Kormat: "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [08:58:37] (03PS7) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [08:58:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089 for InnoDB compression T254462', diff saved to https://phabricator.wikimedia.org/P11690 and previous config saved to /var/cache/conftool/dbconfig/20200629-085854-marostegui.json [08:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:59] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [08:59:07] (03CR) 10Marostegui: [C: 03+2] db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608267 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [08:59:32] (03CR) 10Kormat: [C: 03+2] install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768) (owner: 10Kormat) [08:59:39] !log Compress InnoDB on db1089 (this will cause lag and will take a few days) - T254462 [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:11] (03CR) 10Jbond: [C: 03+2] cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:03:52] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [09:06:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [09:08:33] (03PS7) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [09:08:36] ACKNOWLEDGEMENT - Time elapsed since the last kafka event processed by purged on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} Ema purged restarted, processing backlog T256444 https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [09:08:36] ACKNOWLEDGEMENT - Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} Ema purged restarted, processing backlog T256444 https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [09:08:48] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: cluster=cache_upload instance=cp5002 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [09:09:19] (03CR) 10Ema: [C: 03+2] Serialize access to KafkaReader.maxts [software/purged] - 10https://gerrit.wikimedia.org/r/608045 (https://phabricator.wikimedia.org/T256479) (owner: 10Ema) [09:10:36] (03CR) 10Jbond: [C: 03+2] CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [09:11:02] !log dploying shellcheck CI https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693 [09:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:08] (03PS1) 10Ema: Release version 0.16 [software/purged] - 10https://gerrit.wikimedia.org/r/608268 [09:13:05] (03PS22) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [09:13:18] (03CR) 10Ema: [C: 03+2] Release version 0.16 [software/purged] - 10https://gerrit.wikimedia.org/r/608268 (owner: 10Ema) [09:15:20] (03PS1) 10Jbond: run_ci_locally: use latests docker image [puppet] - 10https://gerrit.wikimedia.org/r/608269 [09:18:55] (03PS1) 10Jbond: openstack/files/queens/admin_scripts/wmcs-prod-example: fix shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/608270 [09:22:01] (03PS1) 10Kormat: mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) [09:23:13] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [09:24:08] jenkins is even meaner than marostegui [09:24:42] I trained it [09:25:13] it's dinging me because i have a lookup() in a profile..? [09:27:46] (03PS2) 10Kormat: mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) [09:28:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [09:30:11] (03PS23) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [09:32:59] (03PS1) 10Jbond: reuse-parts.sh: fix shellcheck issue [puppet] - 10https://gerrit.wikimedia.org/r/608273 [09:33:10] (03PS1) 10Privacybatm: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) [09:33:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:09] (03PS1) 10Ema: Build-depend on go 1.14 [software/purged] - 10https://gerrit.wikimedia.org/r/608275 [09:42:53] (03CR) 10DCausse: "pcc output looks fine to me: https://puppet-compiler.wmflabs.org/compiler1002/23504/" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [09:45:55] !log Deploy schema change on dbstore1004:3312 [09:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:58] (03PS3) 10Kormat: mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) [09:47:52] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:47:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:01] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) From upstream they suggest to use `--skip-atexit-teardown` (have we tried it before?): > My bet is th... [09:51:42] (03PS1) 10Alexandros Kosiaris: Rake: Speed up kubeyaml executions [deployment-charts] - 10https://gerrit.wikimedia.org/r/608276 [09:53:15] (03PS1) 10JMeybohm: Remove deprecated and unmaintained image: envoy-tls-local-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/608277 (https://phabricator.wikimedia.org/T253396) [09:55:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] "From "SUCCESS in 3m 19s" in the parent commit, to "SUCCESS in 52s" in this commit. I am calling this a success already, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608276 (owner: 10Alexandros Kosiaris) [09:56:22] (03Merged) 10jenkins-bot: Rake: Speed up kubeyaml executions [deployment-charts] - 10https://gerrit.wikimedia.org/r/608276 (owner: 10Alexandros Kosiaris) [09:58:31] (03PS2) 10Jbond: idp: enable memcached on production idp servers [puppet] - 10https://gerrit.wikimedia.org/r/607519 (https://phabricator.wikimedia.org/T256113) [09:58:43] (03CR) 10Hashar: [C: 03+1] "Fulfilling CDanis wish at https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693/6/utils/run_ci_locally.sh#46" [puppet] - 10https://gerrit.wikimedia.org/r/608269 (owner: 10Jbond) [09:59:11] (03CR) 10JMeybohm: "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608276 (owner: 10Alexandros Kosiaris) [09:59:46] !log switch idp to memcached [09:59:46] !log cp2040: upgrade purged to 0.16 T256479 [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] T256479: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 [10:00:08] (03CR) 10Jbond: [C: 03+2] idp: enable memcached on production idp servers [puppet] - 10https://gerrit.wikimedia.org/r/607519 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [10:00:13] 10Operations, 10netops: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) I reached out to our Juniper account rep, after a few emails they opened ER-080949 (Enhancement Request). Since Junos 15, it's possible to generate custom OIDs using python scripts: https://www.... [10:00:52] (03PS1) 10Awight: Configure TeWü survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608278 (https://phabricator.wikimedia.org/T253112) [10:00:58] (03PS1) 10Filippo Giunchedi: site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) [10:01:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove deprecated and unmaintained image: envoy-tls-local-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/608277 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [10:03:31] (03PS2) 10Filippo Giunchedi: site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) [10:04:26] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10akosiaris) > My bet is that when you shutdown the workers, the PyObject c structure of each python object is '... [10:04:29] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:41] (03CR) 10Ema: [C: 03+1] Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [10:07:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:15:45] (03PS1) 10Jbond: idp: failover to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/608281 [10:18:38] (03PS1) 10Alexandros Kosiaris: ores::web: Pass skip-atexit-teardown to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/608282 (https://phabricator.wikimedia.org/T242705) [10:18:51] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) https://github.com/confluentinc/confluent-kafka-go/issues/251 [10:20:53] (03CR) 10Elukey: [C: 03+1] ores::web: Pass skip-atexit-teardown to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/608282 (https://phabricator.wikimedia.org/T242705) (owner: 10Alexandros Kosiaris) [10:21:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/608281 (owner: 10Jbond) [10:22:10] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [10:26:38] (03CR) 10Jbond: [C: 03+2] idp: failover to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/608281 (owner: 10Jbond) [10:29:48] !log restart blazegraph on wdqs1004 + depool to catchup on lag [10:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1030). [10:31:58] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608285 (https://phabricator.wikimedia.org/T128546) [10:32:57] (03CR) 10Kormat: "pcc tested: https://puppet-compiler.wmflabs.org/compiler1001/23505/" [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [10:34:43] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608285 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:23] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608285 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:00] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:608284| Bumping portals to master (608284)]] (duration: 00m 58s) [10:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:58] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:608284| Bumping portals to master (608284)]] (duration: 00m 57s) [10:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:22] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:46] (03CR) 10Vgutierrez: [C: 03+1] "LGTM. I'd merge it with puppet disabled on archiva[1001,1002] to let acme-chief issue the new "archiva-old" cert first" [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:46:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:49:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1100). [11:00:05] awight: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] I can deploy my patch. [11:00:19] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608278 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:01:26] (03Merged) 10jenkins-bot: Configure TeWü survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608278 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:01:51] TimStarling: u left a toy in the kitchen :-) w/T256395-cookie-test.php [11:02:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:07:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:07:34] (03PS1) 10Awight: Revert "Configure TeWü survey on dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608236 (https://phabricator.wikimedia.org/T253112) [11:07:45] (03CR) 10Bartosz Dziewoński: [C: 03+1] Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [11:07:51] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608236 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:08:09] (03PS3) 10Muehlenhoff: Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) [11:08:41] (03Merged) 10jenkins-bot: Revert "Configure TeWü survey on dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608236 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:08:53] !log Deploy schema change on db1095:3312 (lag will show up) [11:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:46] (03CR) 10Muehlenhoff: Handle CAS war updates (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [11:12:37] (03PS4) 10Bartosz Dziewoński: Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 [11:15:58] !log EU BACON cooked [11:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:19:22] (03PS1) 10Jbond: idp: fail back to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/608291 [11:20:00] (03CR) 10Jbond: [C: 03+2] idp: fail back to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/608291 (owner: 10Jbond) [11:24:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:34:02] (03PS1) 10CDanis: enable sampling on additional eqiad fpcs [homer/public] - 10https://gerrit.wikimedia.org/r/608293 (https://phabricator.wikimedia.org/T256512) [11:36:49] (03PS1) 10CDanis: depool eqiad for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/608294 (https://phabricator.wikimedia.org/T256512) [11:38:40] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/608293 (https://phabricator.wikimedia.org/T256512) (owner: 10CDanis) [11:40:33] PROBLEM - Device not healthy -SMART- on cp3053 is CRITICAL: cluster=cache_upload device=nvme0 instance=cp3053 job=node site=esams https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3053&var-datasource=esams+prometheus/ops [11:40:33] (03CR) 10CDanis: [C: 03+2] depool eqiad for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/608294 (https://phabricator.wikimedia.org/T256512) (owner: 10CDanis) [11:41:20] !log depool eqiad T256512 [11:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:59] (03PS1) 10CDanis: Revert "depool eqiad for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/608238 (https://phabricator.wikimedia.org/T256512) [11:45:50] (03CR) 10RhinosF1: [C: 03+1] Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [11:46:07] (03CR) 10CDanis: [C: 03+2] enable sampling on additional eqiad fpcs [homer/public] - 10https://gerrit.wikimedia.org/r/608293 (https://phabricator.wikimedia.org/T256512) (owner: 10CDanis) [11:46:33] (03Merged) 10jenkins-bot: enable sampling on additional eqiad fpcs [homer/public] - 10https://gerrit.wikimedia.org/r/608293 (https://phabricator.wikimedia.org/T256512) (owner: 10CDanis) [11:46:35] (03PS1) 10Hashar: zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) [11:49:23] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove overrides to welcome survey privacy policy URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608297 (https://phabricator.wikimedia.org/T252572) [11:53:24] James_F: is the lockeddown dblist patches safe for deploy now? [11:54:45] (03PS1) 10Ssingh: wikidough: update firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/608299 (https://phabricator.wikimedia.org/T252132) [11:57:55] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23508/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/608299 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:58:42] !log deployed I132075ee on cr2-eqiad [11:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:50] !log deployed I132075ee on cr2-eqiad T256512 [11:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) (owner: 10Arturo Borrero Gonzalez) [11:59:59] !log deployed I132075ee on cr1-eqiad T256512 [12:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:03] (03PS3) 10Alexandros Kosiaris: lvs: Add new proton TLS service [puppet] - 10https://gerrit.wikimedia.org/r/607531 (https://phabricator.wikimedia.org/T225680) [12:01:05] (03PS3) 10Alexandros Kosiaris: lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) [12:01:07] (03PS3) 10Alexandros Kosiaris: lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) [12:01:22] (03PS3) 10Alexandros Kosiaris: lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) [12:01:24] (03PS3) 10Alexandros Kosiaris: proton: Switch dev restbase to talk to TLS proton [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) [12:01:26] (03PS3) 10Alexandros Kosiaris: proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) [12:02:56] (03CR) 10CDanis: [C: 03+2] Revert "depool eqiad for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/608238 (https://phabricator.wikimedia.org/T256512) (owner: 10CDanis) [12:03:24] !log re-pool eqiad T256512 [12:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:27] (03PS1) 10Jbond: icinga: add ldap-icinga.wikimedia.org CNAME [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) [12:06:23] (03PS2) 10Jbond: icinga: add ldap-icinga.wikimedia.org CNAME [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) [12:07:29] (03CR) 10Kormat: [C: 03+1] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/608273 (owner: 10Jbond) [12:09:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608282 (https://phabricator.wikimedia.org/T242705) (owner: 10Alexandros Kosiaris) [12:10:34] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/608273 (owner: 10Jbond) [12:10:37] (03CR) 10Jbond: [C: 03+2] reuse-parts.sh: fix shellcheck issue [puppet] - 10https://gerrit.wikimedia.org/r/608273 (owner: 10Jbond) [12:11:36] (03CR) 10Hashar: "Paladox mentioned a Gerrit plugin that would tentatively address our use case: https://github.com/dburm/pg-test-result-plugin 😊" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [12:19:13] (03PS1) 10Marostegui: mariadb: Reimage db2096 (codfw x1 master) to Buster [puppet] - 10https://gerrit.wikimedia.org/r/608304 (https://phabricator.wikimedia.org/T254871) [12:19:23] (03CR) 10Hashar: [C: 04-1] "On hold for now" [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [12:19:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2096 (codfw x1 master) to Buster [puppet] - 10https://gerrit.wikimedia.org/r/608304 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [12:20:39] !log Stop MySQL on db2096 (codfw x1 master) for reimage T254871 [12:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:43] T254871: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 [12:21:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:25:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:27:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:44] (03PS3) 10Filippo Giunchedi: site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) [12:30:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/23510/" [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi) [12:32:50] !log deleted all tags for docker-registry.wikimedia.org/envoy-tls-local-proxy from docker registry - T253396 [12:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:54] T253396: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 [12:34:25] (03PS1) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:36:17] (03PS1) 10Kormat: install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) [12:38:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:38:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [12:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:38:37] (03PS2) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:38:39] (03PS1) 10Jbond: role::alerting_host: use the same SSL cert for cas-icinga and icinga [puppet] - 10https://gerrit.wikimedia.org/r/608307 [12:41:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:49] (03CR) 10Marostegui: [C: 03+1] mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [12:47:12] (03CR) 10Kormat: [C: 03+2] mariadb: Disable prolonged-lag check for non-replication cases. [puppet] - 10https://gerrit.wikimedia.org/r/608271 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [12:48:25] (03PS3) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:48:50] (03CR) 10Muehlenhoff: "Good to review, ran some tests on idp-test2001" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [12:51:31] (03CR) 10Ottomata: [C: 03+1] labs: Update eventgate placeholders in Beta Cluster to not use deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608223 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [12:52:12] (03PS2) 10Jbond: role::alerting_host: use the same SSL cert for cas-icinga and icinga [puppet] - 10https://gerrit.wikimedia.org/r/608307 [12:52:33] (03CR) 10Elukey: [C: 03+2] Move archiva.wikimedia.org from archiva1001 to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [12:53:20] (03PS4) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:53:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [12:54:54] (03PS3) 10Jbond: role::alerting_host: use the same SSL cert for cas-icinga and icinga [puppet] - 10https://gerrit.wikimedia.org/r/608307 [12:55:51] (03PS2) 10Kormat: install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) [12:56:09] (03PS5) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:56:29] (03PS4) 10Jbond: role::alerting_host: use the same SSL cert for cas-icinga and icinga [puppet] - 10https://gerrit.wikimedia.org/r/608307 [12:56:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085', diff saved to https://phabricator.wikimedia.org/P11692 and previous config saved to /var/cache/conftool/dbconfig/20200629-125630-marostegui.json [12:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:40] (03PS6) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [12:57:06] (03CR) 10Vgutierrez: "I'd split this change in two:" [puppet] - 10https://gerrit.wikimedia.org/r/608307 (owner: 10Jbond) [12:57:26] (03PS1) 10Elukey: Update CNAMEs for archiva [dns] - 10https://gerrit.wikimedia.org/r/608308 (https://phabricator.wikimedia.org/T252767) [12:58:05] (03CR) 10Elukey: [C: 03+2] Update CNAMEs for archiva [dns] - 10https://gerrit.wikimedia.org/r/608308 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [12:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P11693 and previous config saved to /var/cache/conftool/dbconfig/20200629-125824-marostegui.json [12:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:42] (03PS1) 10Michael Große: Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) [13:00:10] 10Operations, 10Wikimedia-Mailing-lists: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Tgr) It would be great if the gravatar proxy were a service available for all kinds of tools, as there are plenty of potential applications ({T191183} for example). [13:00:24] !log move archiva.wikimedia.org to archiva1002 (new buster vm); create archiva-old.wikimedia.org to archiva1001 [13:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:05] PROBLEM - HTTPS on archiva1001 is CRITICAL: SSL CRITICAL - failed to verify archiva.wikimedia.org against archiva-old.wikimedia.org https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:02:16] !log test pfw3-codfw uplinks failover [13:02:17] yep this is me --^ [13:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:53] PROBLEM - HTTPS on archiva1002 is CRITICAL: SSL CRITICAL - failed to verify archiva-new.wikimedia.org against archiva.wikimedia.org https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:05:07] I can't push new commits to gerrit, is this a know issue? [13:05:22] " ! [remote rejected] HEAD -> refs/publish/master/sticky (prohibited by Gerrit: update for creating new commit object not permitted)" [13:05:33] edsanders: your git-review is outdated, update it [13:05:37] (03PS5) 10Jbond: role::alerting_host: add cas-icinga sni to icinga.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/608307 [13:05:39] (03PS1) 10Jbond: role::alerting_host: update the cas-icinga vhost to use the icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/608312 [13:06:11] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/608307 (owner: 10Jbond) [13:06:57] (03PS7) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [13:08:15] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [13:08:20] Majavah: what version do I need? [13:08:21] RECOVERY - HTTPS on archiva1002 is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2020-08-17 05:00:12 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:08:47] edsanders: 1.27 or newer [13:08:48] edsanders: 1.27 or newer [13:08:53] I have 1.27 [13:09:14] oh, wait I have 1.26, sorry [13:09:27] (03PS8) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [13:09:36] I think I have two versions [13:09:57] edsanders: the correct one, and the one you use? :D [13:10:07] yeah - i have 1.27 from apt [13:13:09] thanks, works now [13:16:01] (03PS6) 10Jbond: role::alerting_host: add additional SNI's to icinga.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/608307 [13:16:28] (03PS2) 10Jbond: role::alerting_host: update the cas-icinga vhost to use the icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/608312 [13:16:41] (03PS9) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [13:18:53] (03PS10) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [13:21:08] RECOVERY - HTTPS on archiva1001 is OK: SSL OK - Certificate archiva-old.wikimedia.org valid until 2020-09-27 11:54:55 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:21:22] \o/ [13:23:15] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23521/" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [13:24:20] (03PS2) 10Filippo Giunchedi: thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) [13:24:22] (03PS1) 10Filippo Giunchedi: monitoring: switch to new names for global availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) [13:25:45] (03CR) 10Ladsgroup: [C: 03+1] Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [13:26:08] elukey: :D [13:26:13] (03CR) 10Paladox: planet: replace system/user group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [13:26:35] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [13:34:19] (03Abandoned) 10JMeybohm: Update envoy, add ability to define an idle timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580906 (owner: 10Giuseppe Lavagetto) [13:35:32] !log depool cp3053 due to nvme hardware issues [13:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:47] jouncebot: next [13:38:47] In 3 hour(s) and 21 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1700) [13:38:56] OK, will do some minor config deploys. [13:39:04] (03PS3) 10Jforrester: dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 [13:40:23] 10Operations, 10serviceops, 10CPT Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Mholloway) [13:40:35] (03CR) 10Jforrester: [C: 03+2] dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 (owner: 10Jforrester) [13:40:53] (03CR) 10Muehlenhoff: planet: replace system/user group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [13:41:24] (03Merged) 10jenkins-bot: dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 (owner: 10Jforrester) [13:41:33] (03PS3) 10Jforrester: Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 [13:43:01] !log jforrester@deploy1001 Synchronized dblists/lockeddown.dblist: Add lockddown dblist (unused as yet) (duration: 00m 59s) [13:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:14] (03CR) 10Jforrester: [C: 03+2] Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 (owner: 10Jforrester) [13:44:06] (03Merged) 10jenkins-bot: Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 (owner: 10Jforrester) [13:47:21] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10Vgutierrez) [13:47:39] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Add 'lockeddown' dblist to production reads (duration: 00m 57s) [13:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:59] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10Vgutierrez) p:05Triage→03Medium [13:48:30] (03PS3) 10Jforrester: dblists: Drop nonbetafeatures, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606973 [13:49:04] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch uses from nonbetafeatures to lockeddown (duration: 00m 57s) [13:49:07] ACKNOWLEDGEMENT - Device not healthy -SMART- on cp3053 is CRITICAL: cluster=cache_upload device=nvme0 instance=cp3053 job=node site=esams Vgutierrez T256632 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3053&var-datasource=esams+prometheus/ops [13:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Drop 'nonbetafeatures' dblist from production reads (duration: 00m 56s) [13:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:45] (03CR) 10Jforrester: [C: 03+2] dblists: Drop nonbetafeatures, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606973 (owner: 10Jforrester) [13:51:55] (03PS5) 10Jforrester: Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [13:52:32] (03Merged) 10jenkins-bot: dblists: Drop nonbetafeatures, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606973 (owner: 10Jforrester) [13:52:36] (03CR) 10Krinkle: [C: 03+1] "LGTM, diffs don't lie" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [13:54:04] !log hnowlan@deploy1001 Started deploy [restbase/deploy@ce5177e]: Enable gom wiktionary [13:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:19] !log jforrester@deploy1001 Synchronized dblists/: Drop nonbetafeatures dblist, unused (duration: 00m 57s) [13:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:03] (03CR) 10Jforrester: [C: 03+2] Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [13:56:46] (03Merged) 10jenkins-bot: Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [13:57:04] PROBLEM - Restbase root url on restbase1016 is CRITICAL: connect to address 10.64.0.31 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:58:22] (03PS3) 10Jforrester: multiversion: Fix 'closed-labs' reading as 'closed' for static config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608249 (https://phabricator.wikimedia.org/T109157) (owner: 10Krinkle) [14:00:44] (03CR) 10Jforrester: [C: 03+2] multiversion: Fix 'closed-labs' reading as 'closed' for static config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608249 (https://phabricator.wikimedia.org/T109157) (owner: 10Krinkle) [14:01:28] (03Merged) 10jenkins-bot: multiversion: Fix 'closed-labs' reading as 'closed' for static config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608249 (https://phabricator.wikimedia.org/T109157) (owner: 10Krinkle) [14:02:57] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Fix 'closed-labs' reading as 'closed' for static config (duration: 00m 56s) [14:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:12] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:32] (03CR) 10Jforrester: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594216 (https://phabricator.wikimedia.org/T251715) (owner: 10Jforrester) [14:04:33] (03PS2) 10Jforrester: Revert "dblists: Remove "do not modify" note from all.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 [14:04:37] (03PS3) 10Jforrester: buildDBLists: Remove circular dependency on all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594216 (https://phabricator.wikimedia.org/T251715) [14:05:50] (03CR) 10Filippo Giunchedi: "Overall LGTM, the vhost configuration though has significant diffs, e.g." [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [14:07:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [14:08:01] (03CR) 10Jforrester: "Hmm. This was intentionally enabled on a Beta-only wiki just in case we accidentally enabled it in actual production, but I think this is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608246 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [14:08:25] (03PS1) 10Marostegui: db2096: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608327 (https://phabricator.wikimedia.org/T254871) [14:09:06] (03CR) 10Marostegui: [C: 03+2] db2096: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608327 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [14:11:36] (03CR) 10Jbond: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [14:14:47] (03CR) 10Jcrespo: "I am ok with this, but I wonder how we can document this better so other people don't ping us saying "there is a bug on netboot.cnf, the r" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:14:47] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@ce5177e]: Enable gom wiktionary (duration: 20m 44s) [14:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:07] !log hnowlan@deploy1001 Started deploy [restbase/deploy@900bcf6]: Enable gom wiktionary [14:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [14:16:31] (03PS2) 10JMeybohm: Add patches for swift auth and bind interface [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) [14:17:42] RECOVERY - Restbase root url on restbase1016 is OK: HTTP OK: HTTP/1.1 200 - 16515 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/RESTBase [14:17:58] (03PS2) 10Lucas Werkmeister (WMDE): Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [14:18:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [14:19:21] (03Merged) 10jenkins-bot: Add "E" as an alias of EntitySchema namespace on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608309 (https://phabricator.wikimedia.org/T245529) (owner: 10Michael Große) [14:20:07] (Sorry, prod now clear, forgot to say.) [14:20:32] !log upload purged 0.16 to apt.wm.org T256479 [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:45] T256479: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 [14:21:24] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [14:22:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:608309|Add "E" as an alias of EntitySchema namespace on wikidata (T245529)]] (duration: 00m 57s) [14:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:40] T245529: Namespace alias for EntitySchema - https://phabricator.wikimedia.org/T245529 [14:28:00] (03CR) 10Kormat: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:28:41] !log A:cp rolling purged upgrade to 0.16 T256479 [14:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:46] T256479: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 [14:30:15] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:31:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2041 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [14:31:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Add new proton TLS service [puppet] - 10https://gerrit.wikimedia.org/r/607531 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [14:32:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/compiler1002/23516/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/607531 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [14:32:45] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, 10serviceops: Increased "Allowed memory size exhausted" exceptions from MediaWiki since 2020-06-25 ~16:00 - https://phabricator.wikimedia.org/T256459 (10Jdforrester-WMF) AIUI, previously all the memory-exhausted errors from Parsoid w... [14:33:56] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@900bcf6]: Enable gom wiktionary (duration: 17m 49s) [14:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] !log hnowlan@deploy1001 Started deploy [restbase/deploy@900bcf6]: Enable gom wiktionary [14:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:04] (03CR) 10Kormat: "+moritz to have Opinions :)" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:37:44] (03PS1) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [14:37:55] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services), and 2 others: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10Aklapper) @thcipriani: A #good_first_task is a self-contain... [14:37:57] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi) [14:37:59] (03CR) 10Ppchelko: [C: 03+1] Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [14:38:24] (03PS2) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [14:40:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [14:40:25] (03CR) 10Vgutierrez: [C: 03+1] role::alerting_host: add additional SNI's to icinga.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/608307 (owner: 10Jbond) [14:43:21] (03PS11) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [14:43:29] (03CR) 10Muehlenhoff: "We could add a stub partman config like "manual-setup.cfg" which only has a comment that a server with this kind of recipe gets installed " [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:44:59] (03CR) 10Jbond: [C: 03+2] role::alerting_host: add additional SNI's to icinga.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/608307 (owner: 10Jbond) [14:48:33] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@900bcf6]: Enable gom wiktionary (duration: 13m 40s) [14:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:38] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:50:17] !log hnowlan@deploy1001 Started deploy [restbase/deploy@900bcf6]: Redeploy to fix transient error in gom wiktionary deploy [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:23] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@900bcf6]: Redeploy to fix transient error in gom wiktionary deploy (duration: 00m 06s) [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:32] PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:02] (03CR) 10Krinkle: [C: 03+1] "Yep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594216 (https://phabricator.wikimedia.org/T251715) (owner: 10Jforrester) [14:59:47] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:02:52] that's ^ me, sorry, fixed [15:04:06] RECOVERY - Host ms-be2051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [15:05:10] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:05:12] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:07:08] RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [15:07:22] 10Operations, 10Puppet: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10hashar) See also a prior error at T247652#6255667 and my response below. The initial puppet provisioning installs the Debian package which spawn the process with the stock configuration, puppet t... [15:08:02] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10Jseddon) [15:10:17] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10hashar) 05Open→03Resolved a:03ema This was a single use task. The rest will be done as part of deploying the scap configuration on the deployment servers which is parent task T256005... [15:11:55] (03CR) 10EBernhardson: [C: 03+1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [15:12:52] (03CR) 10Dbarratt: [C: 03+1] Require editinterface to edit NS_CONFIG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608212 (https://phabricator.wikimedia.org/T256278) (owner: 10Ammarpad) [15:12:59] (03CR) 10Hashar: [C: 03+1] "The repository scap configuration has been merged ( https://gerrit.wikimedia.org/r/c/integration/docroot/+/607055 )" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [15:13:49] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) I open a ticket # 1674336 with CY1 to disconnect and connect new PDU's tomorrow at 9:30am CT [15:19:22] PROBLEM - Host ms-be2053 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:03] !log repool wdqs1004 - catched up on lag [15:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02205 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:24:02] RECOVERY - Host ms-be2053 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:25:04] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [15:25:50] RECOVERY - Host ms-be2053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [15:26:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:27:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:28:18] aren't fatals more common in the last 4 days^? [15:29:06] PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:10] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, 10serviceops: Increased "Allowed memory size exhausted" exceptions from MediaWiki since 2020-06-25 ~16:00 - https://phabricator.wikimedia.org/T256459 (10ssastry) [15:31:14] 10Operations, 10Parsoid, 10serviceops, 10User-brennen, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) [15:31:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P11696 and previous config saved to /var/cache/conftool/dbconfig/20200629-153140-marostegui.json [15:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:16] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, 10serviceops: Increased "Allowed memory size exhausted" exceptions from MediaWiki since 2020-06-25 ~16:00 - https://phabricator.wikimedia.org/T256459 (10ssastry) I merged this into the other task since, as @Jdforrester-WMF noted, thi... [15:34:12] RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:35:18] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003781 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:37:14] RECOVERY - Host ms-be2055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [15:37:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:37:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:20] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 3 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` ganeti1006.eqiad.wmnet ` [15:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:43:31] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:34] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 4 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` ganeti1006.eqiad.wmnet ` [15:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:26] 10Operations, 10ops-codfw: Return asw-c8-codfw to spares - https://phabricator.wikimedia.org/T256498 (10Papaul) 05Open→03Resolved a:03Papaul Complete [15:46:22] 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Papaul) 05Open→03Resolved An ILO reset fixed the issue [15:47:52] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10ssingh) Hi @Edtadros. Thanks for providing the information. To continue, we will require approval from your manager (already added on the task) and s... [15:48:15] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Radar: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10Ottomata) [15:55:56] 10Operations, 10Traffic: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10BBlack) p:05Triage→03Low [15:56:38] (03PS1) 10BBlack: nvme formatting was missing for new codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/608425 (https://phabricator.wikimedia.org/T256655) [15:57:19] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10BBlack) [15:58:01] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10BBlack) [16:04:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608299 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:19:06] PROBLEM - Disk space on stat1006 is CRITICAL: DISK CRITICAL - free space: / 1142 MB (1% inode=95%): /tmp 1142 MB (1% inode=95%): /var/tmp 1142 MB (1% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1006&var-datasource=eqiad+prometheus/ops [16:21:16] this is a big set of tmp files --^ [16:24:57] removed the file, should recover soon [16:26:24] 🐱 [16:26:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10elukey) To speed up the approval procedure, please also add details about why access is needed etc.. (a bit more verbose than Data QA in production pl... [16:30:18] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [16:36:31] 10Operations, 10LDAP-Access-Requests: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10thcipriani) [16:37:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Jclark-ctr) host rack. Switchport. asset tag cloudvirt1031 C8 7 WMF4817 cloudvirt1032 C8 8 WM... [16:38:45] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10RobH) [16:38:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:39:56] RECOVERY - Disk space on stat1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1006&var-datasource=eqiad+prometheus/ops [16:40:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Jclark-ctr) [16:40:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Jclark-ctr) host rack. switch port asset tag cloudcephosd1004 C8 22 WMF5103 cl... [16:41:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:43:31] 10Operations, 10Research, 10observability, 10Patch-For-Review: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10Dzahn) [16:45:49] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10Vgutierrez) we have scheduled a system reboot of these boxes.. I'll sync that with the "re-format" of the NVMe devices. [16:47:02] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10BBlack) [16:51:52] 10Operations, 10ops-eqiad, 10DC-Ops: apply hostname labels to bast1002/WMF4749 - https://phabricator.wikimedia.org/T186625 (10Jclark-ctr) 05Open→03Resolved host labeled resolving tast [16:51:54] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623 (10Jclark-ctr) [16:54:42] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2041 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [16:56:16] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [16:57:34] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10elukey) There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively where needed (as opposed to roll it out... [17:00:04] gehel and onimisionipe: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1700). [17:06:01] (03CR) 10Dzahn: [C: 03+2] releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn) [17:06:48] (03PS1) 10Ssingh: admin: add Dan Shick to ldap_only_users group (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/608432 (https://phabricator.wikimedia.org/T254442) [17:17:11] (03PS1) 10Dave Pifke: [WIP] webperf: Scrape coal exporter [puppet] - 10https://gerrit.wikimedia.org/r/608434 (https://phabricator.wikimedia.org/T225740) [17:19:16] (03CR) 10Dzahn: [C: 04-1] admin: add Dan Shick to ldap_only_users group (WMDE) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608432 (https://phabricator.wikimedia.org/T254442) (owner: 10Ssingh) [17:20:24] (03PS2) 10Ssingh: admin: add Dan Shick to ldap_only_users group (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/608432 (https://phabricator.wikimedia.org/T254442) [17:21:27] (03CR) 10Dzahn: [C: 03+1] admin: add Dan Shick to ldap_only_users group (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/608432 (https://phabricator.wikimedia.org/T254442) (owner: 10Ssingh) [17:22:44] (03CR) 10Ssingh: [C: 03+2] admin: add Dan Shick to ldap_only_users group (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/608432 (https://phabricator.wikimedia.org/T254442) (owner: 10Ssingh) [17:30:34] !log LDAP - added datn to groups wmde, nda - T254442 [17:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:40] T254442: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 [17:35:59] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Mayakp.wiki) [17:38:02] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10ssingh) Hi @danshick-wmde: You have been added to the "wmde" and "nda" groups and should be able to access Superset now. > $ /usr/bin/ldapse... [17:47:42] (03CR) 10Dave Pifke: "I reverted this in beta when I was done testing (thus no tag)." [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [17:51:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [17:59:11] jouncebot: next [17:59:11] In 0 hour(s) and 0 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1800) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T1800). [18:00:04] ProcReader and Cicalese: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:18] I'm here. [18:00:21] hi [18:00:53] I can deploy today! [18:01:13] Thank you! [18:01:46] (03PS3) 10Urbanecm: Setup rollbacker and mover on lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608247 (https://phabricator.wikimedia.org/T256109) (owner: 10ProcrastinatingReader) [18:02:18] (03CR) 10Urbanecm: [C: 03+2] Setup rollbacker and mover on lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608247 (https://phabricator.wikimedia.org/T256109) (owner: 10ProcrastinatingReader) [18:03:04] CindyCicaleseWMF: can you test yours? [18:03:10] (once it's ready, not now) [18:03:15] yes, absolutely [18:03:26] Hey ProcReader [18:03:37] hey [18:06:46] (03Merged) 10jenkins-bot: Setup rollbacker and mover on lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608247 (https://phabricator.wikimedia.org/T256109) (owner: 10ProcrastinatingReader) [18:07:48] ProcReader: do you have WikimediaDebug Ready? [18:07:59] (03PS3) 10Urbanecm: Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:08:07] yeah, mwdebug1002? [18:08:21] (03CR) 10Urbanecm: [C: 03+2] Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:08:42] ProcReader: mwdebug1001 :) [18:08:45] (it's there, now) [18:09:18] ah. yeah, looks good [18:09:21] (03Merged) 10jenkins-bot: Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:09:44] thanks, syncing [18:10:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: aeb7b52: Setup rollbacker and mover on lijwiki (T256109) (duration: 02m 05s) [18:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:26] T256109: Enable new rights groups on Ligurian Wikipedia - https://phabricator.wikimedia.org/T256109 [18:10:46] CindyCicaleseWMF: your patch is at mwdebug1001 :) [18:10:56] excellent - testing [18:11:01] Great ProcReader [18:11:11] looks good, thanks Urbanecm :) [18:11:18] no problem :) [18:11:24] and RhinosF1 [18:11:36] :) [18:12:18] looks good! [18:13:44] thanks, syncing! [18:14:03] excellent - thank you! [18:14:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:15:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c86fcd4: Add HTTP proxy to MediaModeration (T247943) (duration: 00m 58s) [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [18:15:39] CindyCicaleseWMF: here you go :) [18:15:45] !log Morning B&C done [18:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:07] \o/ Thank you @Urbanecm! [18:16:12] happy to help! [18:18:42] (03PS1) 10Bstorm: wikireplicas: record grant for wikiscan [puppet] - 10https://gerrit.wikimedia.org/r/608438 (https://phabricator.wikimedia.org/T227462) [18:21:44] (03CR) 10Bstorm: [C: 03+2] wikireplicas: record grant for wikiscan [puppet] - 10https://gerrit.wikimedia.org/r/608438 (https://phabricator.wikimedia.org/T227462) (owner: 10Bstorm) [18:28:30] PROBLEM - IPMI Sensor Status on logstash2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:29:40] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10ssingh) Hi @KuboF: The list `vmeo-estraro` has been created and you should have received an email. The list info page is at https://lists.wikimedia.... [18:38:28] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10Dzahn) > Please, make the list hidden, so it does not appear in the public listing on https://lists.wikimedia.org/mailman/listinfo Please avoid that. It... [18:48:55] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list for ILAE English Wikipedia project - https://phabricator.wikimedia.org/T256193 (10ssingh) 05Open→03Resolved a:03ssingh Hi @Diptanshu.D: The list `WP.Epilepsy` has been created and you should have received an email. The list info pa... [18:50:01] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10ssingh) 05Open→03Resolved Marking this as resolved. Please feel free to reopen if there are any issues or questions. Thanks! [18:50:20] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10ssingh) a:05KFrancis→03ssingh [18:51:43] (03PS1) 10ProcrastinatingReader: Add arbcom group to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) [18:52:53] (03PS1) 10Bstorm: wikireplicas: record connection increase for petscan (catscan2) [puppet] - 10https://gerrit.wikimedia.org/r/608441 (https://phabricator.wikimedia.org/T255730) [18:55:34] !log test mtail rc35+wmf2 on cp5001 - T255776 [18:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:38] T255776: mtail "syscall spam" / high cpu usage on logstash1023 - https://phabricator.wikimedia.org/T255776 [18:58:10] (03CR) 10Bstorm: [C: 03+2] wikireplicas: record connection increase for petscan (catscan2) [puppet] - 10https://gerrit.wikimedia.org/r/608441 (https://phabricator.wikimedia.org/T255730) (owner: 10Bstorm) [18:58:13] 10Operations, 10LDAP-Access-Requests: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10ssingh) a:03ssingh [19:05:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:08:34] (03PS1) 10Ottomata: Refine: Only filter for allowed domains from external EventLogging data [puppet] - 10https://gerrit.wikimedia.org/r/608443 [19:23:16] (03CR) 10Joal: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608443 (owner: 10Ottomata) [19:26:32] (03PS1) 10Ssingh: admin: add Ahmon Dancy to ldap_only_users group [puppet] - 10https://gerrit.wikimedia.org/r/608449 (https://phabricator.wikimedia.org/T256658) [19:32:39] (03PS1) 10Cwhite: hiera: install mtail from component in codfw and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/608450 (https://phabricator.wikimedia.org/T255776) [19:36:47] (03PS1) 10Hashar: releases: remove phpunit and php-curl [puppet] - 10https://gerrit.wikimedia.org/r/608452 (https://phabricator.wikimedia.org/T256164) [19:37:52] (03CR) 10Hashar: "We already cleaned out composer and the php packages, but there are a couple mores in the releases::init module ;)" [puppet] - 10https://gerrit.wikimedia.org/r/608452 (https://phabricator.wikimedia.org/T256164) (owner: 10Hashar) [19:38:12] (03CR) 10Dzahn: [C: 03+1] admin: add Ahmon Dancy to ldap_only_users group [puppet] - 10https://gerrit.wikimedia.org/r/608449 (https://phabricator.wikimedia.org/T256658) (owner: 10Ssingh) [19:40:40] (03CR) 10Ssingh: [C: 03+2] admin: add Ahmon Dancy to ldap_only_users group [puppet] - 10https://gerrit.wikimedia.org/r/608449 (https://phabricator.wikimedia.org/T256658) (owner: 10Ssingh) [19:41:32] (03CR) 10RLazarus: "Friendly ping -- I'll go ahead with this soon if there are no objections." [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [19:42:58] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10ssingh) 05Open→03Resolved Hi @thcipriani. This is completed and marking it as resolved. Let me know if there are any other concerns -- feel free to... [19:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3315 T256679', diff saved to https://phabricator.wikimedia.org/P11698 and previous config saved to /var/cache/conftool/dbconfig/20200629-194327-marostegui.json [19:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:32] T256679: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 [19:47:58] (03CR) 10EBernhardson: Default PYSPARK_PYTHON to exact versioned python executable used on driver. (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [19:49:05] (03CR) 10Herron: "Is "investigate making cas usernames case sensitive" https://phabricator.wikimedia.org/T256656 a blocker to this?" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [19:53:17] (03CR) 10Dzahn: [C: 03+2] releases: remove phpunit and php-curl [puppet] - 10https://gerrit.wikimedia.org/r/608452 (https://phabricator.wikimedia.org/T256164) (owner: 10Hashar) [19:53:48] mutante: some stuff in the profile, some other in the module :] [19:55:01] yea, assuming it was still specific to just mediawiki [19:58:19] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox): Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10Aklapper) [20:00:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1144:3315 T256679', diff saved to https://phabricator.wikimedia.org/P11699 and previous config saved to /var/cache/conftool/dbconfig/20200629-200002-marostegui.json [20:00:04] halfak and accraze: May I have your attention please! Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T2000) [20:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:07] T256679: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 [20:01:44] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10herron) >>! In T256538#6262958, @Marostegui wrote: > @herron any idea how big these DBs can be and how many writes we'd be expecting? > Which grants would be needed? > > I would assum... [20:05:03] dpifke: Would it make sense to say "admins who have shell on servers with xhgui should also know the mysql password"? Or is that not normally needed and really just a one-off for the new database? [20:06:35] (03PS11) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [20:06:37] (03CR) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [20:10:10] (03CR) 10Herron: [C: 03+1] monitoring: switch to new names for global availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [20:14:49] (03CR) 10Herron: [C: 03+1] "LGTM from cursory check" [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [20:15:24] (03CR) 10EBernhardson: sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [20:16:39] (03CR) 10Herron: [C: 03+1] hiera: install mtail from component in codfw and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/608450 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [20:17:39] (03CR) 10Herron: "removing from my queue, feel free to re-add when ready to resume!" [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [20:19:22] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-da [20:19:22] ometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:20:58] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:21:00] (03PS1) 10Dzahn: xhgui: let perf-team admins have access to xhgui DB (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/608456 (https://phabricator.wikimedia.org/T254795) [20:36:22] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:36:34] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:39:22] 👀 [20:42:38] (03PS12) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [20:43:41] mutante: Once change 603550 is merged, anyone with root on xhguiX001 will have access to the password. by way of being able to read it from /etc/xhgui/config.php, and anyone with deploy access will be able to read it from PrivateSettings.php on deployX001. [20:44:05] The only time I can think we'd be accessing the DB directly is for debugging, so this is mostly a one-off. [20:44:35] (03CR) 10Jeena Huneidi: "Thank you for the comments. I will take them into account while moving this change to the integration/config repo." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [20:45:02] (03Abandoned) 10Jeena Huneidi: Add Cassandra image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [20:45:13] (03PS1) 10Ryan Kemper: Temporarily disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) [20:45:27] dpifke: fair enough. then just check your home dir on deploy1001 now. it's there. fyi i had something like this in mind but it's overkill then: https://gerrit.wikimedia.org/r/c/operations/puppet/+/608456/1/modules/profile/manifests/webperf/xhgui.pp [20:45:51] it would have written a real .my.cnf [20:46:57] There's no harm in having it there, but I think anyone debugging it would also be a webperf root and/or code deployer, and hopefully we it won't need frequent debugging. :) [20:47:32] (03PS2) 10Ryan Kemper: Temporarily disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) [20:47:36] yes, it only makes sense once you have shell admins without full root [20:48:09] (03CR) 10MSantos: [C: 03+1] Temporarily disable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [20:48:20] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:57:32] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo TTN-0004209267 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:57:32] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo TTN-0004209267 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:05] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T2100). [21:02:38] 10Operations, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Ladsgroup) I just stumbled upon https://www.githubstatus.com/ (github had an outage) and I quite liked the timeline of "green, yellow, red" (green = the whole day... [21:07:59] (03CR) 10Ryan Kemper: "https://puppet-compiler.wmflabs.org/compiler1003/23525/" [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [21:15:24] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:17:11] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Remove overrides to welcome survey privacy policy URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608297 (https://phabricator.wikimedia.org/T252572) (owner: 10Kosta Harlan) [21:17:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10wiki_willy) @Jclark-ctr or @Cmjohnson - can one of you doublecheck the s/n's in Netbox? The accounting report says th... [21:22:00] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10ssingh) @KFrancis: Can you please confirm if there is an NDA on file for Guergana as I can't seem to find it in the spreadsheet. @guergana.tzatchkova: Once the NDA... [21:23:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) It's `TA` from the switches CLI. [21:23:13] (03CR) 10Muehlenhoff: "Typing this comment in Gerrit 2 UI would have been such a pain :-)" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [21:24:09] (03CR) 10Ryan Kemper: "Guillaume pointed out that we want to disable only for eqiad, so next commit will change the impl. to do that" [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [21:25:27] Hey all - have two sec patches going out to .38 right now. [21:26:59] (03PS3) 10Ryan Kemper: Temporarily disable tilerator in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) [21:28:05] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Legoktm) >>! In T238803#5680344, @CCicalese_WMF wrote: > As noted in the second last bullet, it is desired that we not archive the ext... [21:33:48] (03CR) 10Cwhite: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608102 (https://phabricator.wikimedia.org/T233448) (owner: 10Accraze) [21:34:09] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) Makes sense. At this point, I think it makes sense to archive EUCopyrightCampaign and EUCopyrightCampaignSkin. [21:39:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10wiki_willy) Cool, thanks @ayounsi. I went ahead and fixed it on the accounting spreadsheet. Thanks, Willy [21:55:24] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10thcipriani) 05Resolved→03Open @ssingh I'm not seeing @dancy in the `ciadmin` list. Additionally @dancy reports he's unable to login to logstash. His... [21:56:10] ~log Deployed patch for T255918 [21:56:17] !log Deployed patch for T255918 [21:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10thcipriani) I do see `uid=adancy`; however, @dancy's `uid=dancy`. [22:00:12] (03PS2) 10Andrew Bogott: cloud-vps: puppetize /etc/ldap.conf on sssd clients [puppet] - 10https://gerrit.wikimedia.org/r/608068 [22:00:14] (03PS1) 10Andrew Bogott: codfw1dev ldap: mirror_mode=true [puppet] - 10https://gerrit.wikimedia.org/r/608463 [22:00:45] !log Deployed patch for T256171 [22:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:26] (03PS7) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [22:01:47] (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [22:04:36] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev ldap: mirror_mode=true [puppet] - 10https://gerrit.wikimedia.org/r/608463 (owner: 10Andrew Bogott) [22:12:17] (03PS1) 10Ssingh: admin: update uid for dancy (fixes 58685eac) [puppet] - 10https://gerrit.wikimedia.org/r/608464 (https://phabricator.wikimedia.org/T256658) [22:12:49] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Legoktm) I filed {T256690} and {T256691}. [22:17:18] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) Thank you, @Legoktm! [22:19:37] (03CR) 10Legoktm: [C: 03+2] webservice-python-bootstrap: install wheel [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093 (owner: 10BryanDavis) [22:20:13] (03Merged) 10jenkins-bot: webservice-python-bootstrap: install wheel [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093 (owner: 10BryanDavis) [22:25:29] (03PS1) 10Legoktm: Commit live changes from tools-docker-imagebuilder-01 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608465 [22:29:36] 10Operations, 10Security: modify-ldap-group should make it impossible to add users who don't exist to a group - https://phabricator.wikimedia.org/T256692 (10Dzahn) [22:33:00] (03CR) 10Dzahn: [C: 03+1] admin: update uid for dancy (fixes 58685eac) [puppet] - 10https://gerrit.wikimedia.org/r/608464 (https://phabricator.wikimedia.org/T256658) (owner: 10Ssingh) [22:33:44] (03CR) 10Ssingh: [C: 03+2] admin: update uid for dancy (fixes 58685eac) [puppet] - 10https://gerrit.wikimedia.org/r/608464 (https://phabricator.wikimedia.org/T256658) (owner: 10Ssingh) [22:33:48] 10Operations, 10Security: modify-ldap-group should make it impossible to add users who don't exist to a group - https://phabricator.wikimedia.org/T256692 (10Legoktm) [22:33:51] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Legoktm) [22:35:46] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10ssingh) Sorry for the confusion: this should now be resolved. [22:38:21] (03CR) 10BryanDavis: [C: 03+2] Add html web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) (owner: 10Legoktm) [22:38:56] (03Merged) 10jenkins-bot: Add html web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) (owner: 10Legoktm) [22:40:08] (03CR) 10Dzahn: [C: 04-1] "achievement unlocked: "puppet-lint has encountered an error that it doesn't know how to handle"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [22:40:56] (03PS2) 10BryanDavis: rebuild_all: Reorder builds so that Jessie is built last [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608465 (owner: 10Legoktm) [22:41:37] (03PS3) 10BryanDavis: rebuild_all: Reorder builds so that Jessie is built last [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608465 (owner: 10Legoktm) [22:41:46] (03CR) 10BryanDavis: [C: 03+2] rebuild_all: Reorder builds so that Jessie is built last [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608465 (owner: 10Legoktm) [22:42:28] (03Merged) 10jenkins-bot: rebuild_all: Reorder builds so that Jessie is built last [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608465 (owner: 10Legoktm) [22:42:52] (03CR) 10Dzahn: [C: 03+1] "Paladox, qchris, this should be good to go now. Well, at least once we upgraded devtools to 3.2 as well." [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [22:45:43] (03CR) 10Dzahn: "so you said the letsencrypt_cert method should be removed completely?" [puppet] - 10https://gerrit.wikimedia.org/r/607116 (owner: 10Dzahn) [22:46:28] (03PS3) 10BryanDavis: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) [22:48:53] (03PS8) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [22:49:33] (03CR) 10Dzahn: [C: 03+2] codesearch: Add port for analytics search profile [puppet] - 10https://gerrit.wikimedia.org/r/608203 (https://phabricator.wikimedia.org/T249318) (owner: 10Legoktm) [22:51:37] (03PS3) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 [22:52:05] (03CR) 10jerkins-bot: [V: 04-1] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [22:53:16] thanks mutante [22:53:22] np, legoktm [22:54:30] (03PS4) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 [22:57:05] (03PS5) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200629T2300). [23:05:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:05:58] (03CR) 10Dzahn: [C: 03+2] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [23:06:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:56] (03CR) 10Dzahn: [C: 03+2] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [23:15:18] (03CR) 10Dzahn: "I did some tests and first ran puppet without a manual intervention. Result: no errors but nothing happens as user already exists." [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [23:16:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:49] (03CR) 10Dzahn: "Manually running the refresh command:" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [23:27:00] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10KFrancis) @ssingh It doesn't look like we have an NDA on file for Guergana. No problem. I can process once when I have the following info: -Full legal name -Ma... [23:29:33] (03PS1) 10Dzahn: systemd::sysuser: quote the gecos field to avoid errors [puppet] - 10https://gerrit.wikimedia.org/r/608489 [23:30:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:32:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:35:15] (03PS1) 10CDanis: add playbook links for important alerts [puppet] - 10https://gerrit.wikimedia.org/r/608490 [23:37:58] (03CR) 10Dzahn: "This also influences all hosts because it is used in base:" [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn) [23:40:56] (03PS1) 10Dzahn: planet: avoid whitespace in gecos field when using systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/608491 [23:44:50] (03CR) 10Dzahn: [C: 03+2] planet: avoid whitespace in gecos field when using systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/608491 (owner: 10Dzahn) [23:49:23] (03CR) 10Dzahn: "as long as there is no space in the gecos field things work:" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [23:54:10] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10Dzahn) confirming that "dancy" is in wmf, releng and ciadmin