[02:59:50] !log Testing twitter integration after software update for Stashbot. In theory messages up to 280 characters in length will now be passed through to the @wikimediatech Twitter feed without being truncated. This message should end with a unicorn face if that is correct. πŸ¦„ [02:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [03:06:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [03:20:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [03:27:59] PROBLEM - configured eth on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:28:25] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [03:28:49] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:51] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [03:29:03] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [03:29:15] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:29:19] PROBLEM - Disk space on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [03:31:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [03:31:35] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [03:32:27] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:35:47] PROBLEM - SSH on analytics-tool1001 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:53:01] !log reboot analytics-tool1001 [03:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:05] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [03:56:09] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:56:15] RECOVERY - Disk space on analytics-tool1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [03:56:27] RECOVERY - SSH on analytics-tool1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:56:29] RECOVERY - configured eth on analytics-tool1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:56:57] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [03:57:19] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:23] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [04:00:19] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:02:11] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is OK: OK: synced at Mon 2019-09-09 04:02:09 UTC. https://wikitech.wikimedia.org/wiki/NTP [04:19:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [04:23:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [05:14:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [05:19:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [05:39:28] ^ there is a task for those alerts already [06:08:50] vgutierrez: thanks for analytics-tool1001 :) [06:09:08] np [06:32:38] (03PS9) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [06:35:15] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10elukey) I agree with Andrew, the issue seems to be a violation of an assert or similar in the Varnish libs, so unlikely related to a Varnishkafka bug (famous last words).... [06:48:55] 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T232267 (10Marostegui) 05Openβ†’03Declined This host is ready to be decommissioned by DC-Ops (T231407), so no need to replace any of these disks [06:49:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) [07:01:20] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10elukey) I am personally aware of the risk but when I reviewed it with @Ottomata we thought it was a... [07:33:41] !log installing ghostscript security updates [07:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:59] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:47:16] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ema) 05Openβ†’03Resolved cp1075 has been serving live production traffic for several days now, we can consider the test successful. [07:47:20] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [07:48:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [07:51:17] (03CR) 10Nikerabbit: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [07:55:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [07:59:41] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:10:11] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:11:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:12:51] 10Operations, 10Traffic: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) [08:18:55] (03PS1) 10Vgutierrez: ATS: Disable hardening for the TLS instance on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/535121 (https://phabricator.wikimedia.org/T232298) [08:20:15] (03PS5) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [08:20:22] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531235 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:20:33] (03PS3) 10Jbond: wmcs::openstack::codfw1dev: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531235 (https://phabricator.wikimedia.org/T102099) [08:21:43] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18210/" [puppet] - 10https://gerrit.wikimedia.org/r/535121 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [08:25:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:26:45] 10Operations, 10Traffic, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) Both crashes seems to be related: `name=crash-2019-09-05-155443.log Thread 12661, [ET_NET 32]: 0 0x000000000049d9f0 crash_logger_invoke(int, siginfo_t*... [08:27:28] (03PS2) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [08:29:03] (03CR) 10Jbond: [C: 03+2] ganeti: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531227 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:29:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:29:11] (03PS2) 10Jbond: ganeti: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531227 (https://phabricator.wikimedia.org/T102099) [08:30:28] (03CR) 10Ema: [C: 03+1] ATS: Disable hardening for the TLS instance on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/535121 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [08:33:32] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable hardening for the TLS instance on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/535121 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [08:33:42] (03PS2) 10Vgutierrez: ATS: Disable hardening for the TLS instance on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/535121 (https://phabricator.wikimedia.org/T232298) [08:33:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:37:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [08:38:43] !log disabling systemd hardening for ats-tls on cp5001 - T232298 [08:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:47] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [08:39:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/533911 (owner: 10Muehlenhoff) [08:42:37] RECOVERY - traffic_server tls process restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [08:45:25] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [08:46:23] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [08:47:53] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/18211/" [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:02:22] !log restart archiva on archiva1001 - stuck and not serving requests (no trace about why in the logs) [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:35] (03CR) 10Vgutierrez: "I'm slightly worried about leaving the :7443 port on the lvs instances unmonitored, or is a hack in place similar to modules/lvs/manifests" [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:08:37] 10Operations, 10Traffic, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) service log shows memory issues: ` Sep 05 15:54:43 cp5001 traffic_server[12607]: Fatal: couldn't allocate 32768 bytes Sep 08 03:24:10 cp5001 traffic_serve... [09:08:47] 10Operations, 10Traffic, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) p:05Triageβ†’03Normal [09:15:09] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10jbond) p:05Triageβ†’03Normal [09:15:33] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10jbond) p:05Triageβ†’03Normal [09:16:01] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10jbond) p:05Triageβ†’03Normal [09:16:51] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10jbond) p:05Triageβ†’03Normal [09:16:53] (03PS1) 10Muehlenhoff: Adapt Cumin alias for stat to new role name [puppet] - 10https://gerrit.wikimedia.org/r/535127 [09:17:24] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10jbond) p:05Triageβ†’03Normal [09:17:29] (03CR) 10Effie Mouzeli: [C: 03+2] Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [09:18:04] (03CR) 10Elukey: [C: 03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/535127 (owner: 10Muehlenhoff) [09:19:49] (03CR) 10Vgutierrez: [C: 03+1] "After some IRC discussion with ema, we agreed on push this forward and aim for HTTPS only services in the "near" future to get rid of the " [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:20:14] (03PS6) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [09:20:53] (03CR) 10Muehlenhoff: [C: 03+2] Adapt Cumin alias for stat to new role name [puppet] - 10https://gerrit.wikimedia.org/r/535127 (owner: 10Muehlenhoff) [09:21:31] (03PS7) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [09:22:18] (03CR) 10Ema: [C: 03+2] lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:22:30] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10jbond) @Nuria are you able to approve @Urbanecm access to `researchers` and `analytics-privatedata-users` [09:28:30] !log lvs1016, lvs2006 (secondaries): restart pybal to add service restbase-ssl T210411 [09:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:33] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [09:29:45] 10Operations, 10media-storage: Server side upload failed with "overwriting failed (at recordUpload stage)" - https://phabricator.wikimedia.org/T231738 (10jbond) @Urbanecm creating this un-triaged task with the #operations tag should be enough to bring it to the attention of the clinic duty, this task must have... [09:30:27] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.17:7443]) https://wikitech.wikimedia.org/wiki/PyBal [09:30:30] 10Operations, 10MediaWiki-Maintenance-scripts, 10media-storage: Server side upload failed with "overwriting failed (at recordUpload stage)" - https://phabricator.wikimedia.org/T231738 (10jbond) p:05Triageβ†’03Normal [09:30:55] the pybal lvs1016 alert should clear soon ^ [09:31:02] thx [09:35:45] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.17:7443]) https://wikitech.wikimedia.org/wiki/PyBal [09:35:55] mmh [09:37:33] (03PS1) 10Elukey: profile::cache::kafka::varnishkafka_delivery_alerts: fix description [puppet] - 10https://gerrit.wikimedia.org/r/535130 [09:37:54] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF service does not Vary responses by Content-Type, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10jbond) p:05Triageβ†’03Normal [09:37:59] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:38:29] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:38:49] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:39:27] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:09] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::varnishkafka_delivery_alerts: fix description [puppet] - 10https://gerrit.wikimedia.org/r/535130 (owner: 10Elukey) [09:40:45] !log updated buster netinst image to 10.1 T232310 [09:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:47] T232310: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 [09:43:53] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10jbond) p:05Triageβ†’03Normal [09:46:53] RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:48:33] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.55 port 9042 https://phabricator.wikimedia.org/T93886 [09:48:58] !log updated stretch netinst image to 9.11 T232308 [09:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:01] T232308: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 [09:53:53] (03PS1) 10Effie Mouzeli: 33.3% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535134 (https://phabricator.wikimedia.org/T219150) [09:53:56] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [09:59:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [10:03:30] RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2020-06-24 13:01:53 +0000 (expires in 289 days) https://phabricator.wikimedia.org/T120662 [10:05:48] 10Operations, 10Traffic: PyBal ProxyFetch checks using HTTP/1.0 with https and HTTP/1.1 with plain http - https://phabricator.wikimedia.org/T232319 (10ema) [10:06:42] 10Operations, 10Traffic: PyBal ProxyFetch checks using HTTP/1.0 with https and HTTP/1.1 with plain http - https://phabricator.wikimedia.org/T232319 (10ema) p:05Triageβ†’03Normal [10:07:23] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [10:08:08] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [10:09:14] (03PS1) 10Vgutierrez: Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) [10:12:32] (03Abandoned) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [10:13:22] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [10:13:26] bd808: nice re microblogging @wikimediatech :) [10:13:54] (03CR) 10Effie Mouzeli: [C: 03+2] 33.3% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535134 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:14:49] (03Merged) 10jenkins-bot: 33.3% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535134 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:16:30] (03CR) 10jenkins-bot: 33.3% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535134 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:21:13] 10Operations, 10Analytics, 10Analytics-Cluster: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10jbond) I tried running the followin command on the server however the Current Cache policy remains as `WriteThrough` ` analytics1045 ~ % sud... [10:21:32] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [10:21:36] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) p:05Triageβ†’03Normal [10:21:42] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10jbond) p:05Triageβ†’03Normal [10:22:38] !log jiji@deploy1001:~$ scap sync-file wmf-config/CommonSettings.php "Push PHP7 traffic to 33.3% - T219150" [10:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:45] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [10:24:46] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 56.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:27:18] (03PS2) 10Vgutierrez: Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) [10:28:24] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 97.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:29:06] (03PS1) 10Ema: envoyproxy: accept HTTP/1.0 [puppet] - 10https://gerrit.wikimedia.org/r/535142 (https://phabricator.wikimedia.org/T210411) [10:30:22] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535144 (https://phabricator.wikimedia.org/T128546) [10:30:37] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535144 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:27] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [10:32:35] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535145 (https://phabricator.wikimedia.org/T128546) [10:33:12] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535144 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:31] (03CR) 10Volans: "Addressed comment, reply inline" (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [10:33:39] (03PS5) 10Volans: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) [10:33:41] (03PS4) 10Volans: config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) [10:33:43] (03PS5) 10Volans: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) [10:33:45] (03PS2) 10Volans: config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 [10:34:42] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10jbond) 05Openβ†’03Resolved a:03jbond Hello Greg, I have created the mailing list and you can now access the [[https://lists.wikimedia.org/mailman/admin/engprod|admin]] and [[ht... [10:35:26] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535145 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:19] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535145 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:38] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535145 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:42] (03PS3) 10Vgutierrez: Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) [10:38:13] 10Operations, 10Wikimedia-Mailing-lists: Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10jbond) 05Openβ†’03Resolved a:03jbond Hello Jean-Rene, I have created the mailing list and you can now access the [[https://lists.wikimedia.org/mailman/admin/testeng|adm... [10:38:56] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:535145| Bumping portals to master (T128546)]] (duration: 00m 54s) [10:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:59] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:38:59] (03PS1) 10Muehlenhoff: Add DHCP config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/535146 (https://phabricator.wikimedia.org/T231015) [10:39:34] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10jbond) p:05Triageβ†’03Normal [10:39:50] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:535145| Bumping portals to master (T128546)]] (duration: 00m 53s) [10:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:55] (03CR) 10Volans: "I'm wondering if we should log and phaste at all on failure, if something didn'g get committed what's the benefit of !log-ging and phast-i" [software/conftool] - 10https://gerrit.wikimedia.org/r/534230 (https://phabricator.wikimedia.org/T231871) (owner: 10CDanis) [10:43:53] (03CR) 10Ema: [C: 03+2] "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/18212/restbase1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535142 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:44:00] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/535146 (https://phabricator.wikimedia.org/T231015) (owner: 10Muehlenhoff) [10:45:36] (03PS2) 10Ema: envoyproxy: accept HTTP/1.0 [puppet] - 10https://gerrit.wikimedia.org/r/535142 (https://phabricator.wikimedia.org/T210411) [10:48:17] moritzm: ok to puppet-merge your change? [10:48:51] yes, sorry [10:48:59] np, done [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:01:00] 10Operations, 10Puppet, 10cloud-services-team: labspuppetmaster1001 puppet-merge failing - https://phabricator.wikimedia.org/T232322 (10jbond) p:05Triageβ†’03Normal [11:11:07] (03PS1) 10Filippo Giunchedi: logstash: stop relaying to central statsd [puppet] - 10https://gerrit.wikimedia.org/r/535148 (https://phabricator.wikimedia.org/T205870) [11:11:09] (03PS1) 10Filippo Giunchedi: prometheus: collect swift account/container stats globally [puppet] - 10https://gerrit.wikimedia.org/r/535149 (https://phabricator.wikimedia.org/T205870) [11:12:10] (03Abandoned) 10Vgutierrez: Add SPF record for wikimedia.ee [dns] - 10https://gerrit.wikimedia.org/r/504242 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [11:12:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [11:14:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: collect swift account/container stats globally [puppet] - 10https://gerrit.wikimedia.org/r/535149 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [11:14:13] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [11:14:14] (03PS2) 10Filippo Giunchedi: prometheus: collect swift account/container stats globally [puppet] - 10https://gerrit.wikimedia.org/r/535149 (https://phabricator.wikimedia.org/T205870) [11:16:21] (03PS1) 10Muehlenhoff: On Ganeti servers print the current master node in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/535150 [11:19:18] Since there are no patches scheduled for SWAT, could a deployer please run a maintscript? [11:19:43] (Amir1, Lucas_WMDE, awight, Urbanecm) [11:19:54] Daimona: certainly [11:19:56] how may I help? [11:19:58] Yay [11:20:07] https://phabricator.wikimedia.org/T231137 [11:20:10] For group2 :) [11:20:19] will do soon! [11:20:29] Cool, ty [11:26:02] !log installing ldap-corp2001 T231015 [11:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:07] T231015: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 [11:27:45] Daimona: I'm back [11:27:54] Nice [11:28:17] I'm here for another 5 minutes, then I'll be AFK for a while [11:28:33] But it's pretty likely that there'll be 0 results for g2 [11:28:54] Daimona: running for all wikis [11:30:03] (03PS1) 10Jbond: netbox: add python3-pynetbox [puppet] - 10https://gerrit.wikimedia.org/r/535151 [11:31:49] Please put it on the task, I'll be back in 10 minutes [11:32:29] Will do Daimona [11:32:43] !log Dry run for all wikis (T231137) [11:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] T231137: Run fixFirstBlockAutopromoteEntries in production - https://phabricator.wikimedia.org/T231137 [11:36:37] (03PS1) 10Muehlenhoff: Drop apt-secure workaround [puppet] - 10https://gerrit.wikimedia.org/r/535152 (https://phabricator.wikimedia.org/T232310) [11:39:12] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:44] I'm back [11:40:46] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:41:03] Daimona: done [11:41:24] Alright, two wikis only [11:41:29] I'm gonna check 'em [11:42:12] let me know Β§ [11:42:32] Sure, it'll take 5ish minutes [11:42:51] sure [11:43:22] (03CR) 10Muehlenhoff: [C: 03+2] Drop apt-secure workaround [puppet] - 10https://gerrit.wikimedia.org/r/535152 (https://phabricator.wikimedia.org/T232310) (owner: 10Muehlenhoff) [11:44:19] fawiki is fine [11:46:02] ok [11:46:03] idwiki too [11:46:12] Two entries are in a different language but that's fine [11:47:05] Green light [11:47:14] thanks Daimona [11:47:15] * cormacparle__ waves [11:47:16] * Urbanecm waves to cormacparle__ [11:47:42] +2'ed your patch cormacparle__ [11:47:52] kk [11:49:47] just realised it depends on another patch that's failing CI [11:50:13] hoping a recheck might do it, failures look CI-related rather than anything to do with the code [11:50:23] cormacparle__: ok, let me know [11:50:55] cormacparle__: patch pulled on mwdebug1002 [11:51:00] let me know if it works [11:51:37] Daimona: still around to do final fawiki test? [11:51:48] Urbanecm: sure [11:51:54] Ready for the actual run [11:52:42] Daimona: fawiki https://phabricator.wikimedia.org/P9061 [11:52:59] Looks great [11:53:24] Daimona: https://phabricator.wikimedia.org/P9062 [11:53:35] (03PS1) 10Mathew.onipe: elasticsearch: add dependencies for JsonLayout [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) [11:53:47] Looks great as well [11:54:02] Calling resolved then, thanks! [11:54:40] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add dependencies for JsonLayout [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [11:57:04] (03PS2) 10Mathew.onipe: elasticsearch: add dependencies for JsonLayout [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) [11:57:05] (03PS1) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535158 (https://phabricator.wikimedia.org/T225125) [11:59:27] (03CR) 10Gehel: elasticsearch: add dependencies for JsonLayout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:00:15] (03PS3) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 [12:00:17] (03PS3) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 [12:00:53] (03CR) 10jerkins-bot: [V: 04-1] ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez) [12:01:00] !log installing ldap-corp1001 T231015 [12:01:09] (03CR) 10jerkins-bot: [V: 04-1] ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 (owner: 10Vgutierrez) [12:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:13] T231015: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 [12:02:50] (03PS3) 10Mathew.onipe: elasticsearch: add dependencies for JsonLayout [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) [12:02:52] (03PS2) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535158 (https://phabricator.wikimedia.org/T225125) [12:04:42] (03CR) 10Mathew.onipe: elasticsearch: add dependencies for JsonLayout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535157 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:09:34] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) The Debian Stretch point release with the OpenSSH backport was released last weekend... [12:11:24] !log Undeployed patch in wmf branch, will resolve soon [12:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:50] (03PS4) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 [12:12:52] (03PS4) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 [12:17:42] (03CR) 10Vgutierrez: "pcc is still happy: https://puppet-compiler.wmflabs.org/compiler1001/18213/" [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez) [12:23:12] Urbanecm: that dependency has finally passed CI [12:23:13] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/534853/ [12:23:23] will need that for the mediainfo patch to work [12:23:55] (03CR) 10Ema: [C: 03+1] ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez) [12:24:23] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=restbase-ssl,name=restbase1022.eqiad.wmnet [12:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:24:51] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=restbase-ssl,name=restbase2009.codfw.wmnet [12:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:37] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:29:44] !log restart archiva again to debug download artifact issue [12:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:16] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=restbase-ssl,dc=codfw [12:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:42] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=restbase-ssl,dc=eqiad [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:51] (03PS4) 10Vgutierrez: Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) [12:34:59] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez) [12:36:24] !log lvs2003 (primary): restart pybal to add service restbase-ssl T210411 [12:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:26] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [12:40:31] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:41:21] !log lvs1015 (primary): restart pybal to add service restbase-ssl T210411 [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:56] cormacparle__: merged [12:44:05] (and pulled onto mwdebug1002) [12:44:18] !log EU SWAT wmf patch ongoing, testing with mwdebug1002 [12:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] On Ganeti servers print the current master node in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/535150 (owner: 10Muehlenhoff) [12:47:38] (03PS1) 10Ema: ATS: use TLS with RESTbase [puppet] - 10https://gerrit.wikimedia.org/r/535178 (https://phabricator.wikimedia.org/T210411) [12:48:09] !log upgrading remaining job runners to PHP 7.2.22 T230024 [12:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:12] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [12:49:07] Urbanecm: seems to be working fine [12:49:23] cormacparle__: okay, will sync. Wikibase patch first. Are they testable separately? [12:49:31] no [12:49:34] Okay [12:50:53] (03CR) 10Vgutierrez: [C: 03+1] ATS: use TLS with RESTbase [puppet] - 10https://gerrit.wikimedia.org/r/535178 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:51:31] Is that current status in topic still correct? [12:51:33] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/Wikibase: ubn patch T231276 (duration: 01m 03s) [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:35] T231276: RevisionBasedEntityLookup.php: Revision 363395998 belongs to M77688146 instead of expected M81625979 - https://phabricator.wikimedia.org/T231276 [12:51:40] (03CR) 10Ema: [C: 03+2] ATS: use TLS with RESTbase [puppet] - 10https://gerrit.wikimedia.org/r/535178 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:51:47] Zppix: yes, still ongoing [12:51:56] Ok, wasnt sure thanks [12:51:58] <_joe_> not that I am aware of Urbanecm [12:53:08] general EU issues are still ongoing, until there's a new message on wmf blog [12:53:14] *general connectivity [12:53:30] https://wikimediafoundation.org/news/2019/09/07/malicious-attack-on-wikipedia-what-we-know-and-what-were-doing/ [12:54:37] cormacparle__: syncing [12:55:19] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/WikibaseMediaInfo/: ubn patch T231276 (duration: 00m 58s) [12:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] T231276: RevisionBasedEntityLookup.php: Revision 363395998 belongs to M77688146 instead of expected M81625979 - https://phabricator.wikimedia.org/T231276 [12:55:34] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [12:55:55] cormacparle__: please test [12:56:41] (03PS5) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 (https://phabricator.wikimedia.org/T231433) [12:57:44] (03CR) 10Ema: [C: 03+1] ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [12:59:11] (03PS1) 10Filippo Giunchedi: grafana: use Prometheus swift metrics for dashboard [puppet] - 10https://gerrit.wikimedia.org/r/535180 (https://phabricator.wikimedia.org/T205870) [12:59:50] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:00:30] Urbanecm: all seems good, thank you! [13:01:39] !log upgrading remaining mediawiki app servers (mw1266-mw1275) to PHP 7.2.22 T230024 [13:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:45] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [13:01:51] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [13:02:04] (03PS6) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 (https://phabricator.wikimedia.org/T231433) [13:02:13] !log Patch is deployed, deploy1001 should be clear [13:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:52] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:09:54] (03PS1) 10Filippo Giunchedi: swift: port object diff alerts to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/535182 (https://phabricator.wikimedia.org/T205870) [13:09:57] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [13:15:33] !log upgrading labweb/wikitech to PHP 7.2.22 T230024 [13:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:36] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [13:15:51] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/535139 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [13:18:43] (03CR) 10CDanis: "> Patch Set 2:" [software/conftool] - 10https://gerrit.wikimedia.org/r/534230 (https://phabricator.wikimedia.org/T231871) (owner: 10CDanis) [13:18:51] (03Abandoned) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/534988 (owner: 10CDanis) [13:23:45] (03PS1) 10Vgutierrez: acme_chief: Remove gerrit-slave.wm.o from the gerrit certificate SNI list [puppet] - 10https://gerrit.wikimedia.org/r/535184 (https://phabricator.wikimedia.org/T229822) [13:27:01] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/535184 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez) [13:29:12] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Remove gerrit-slave.wm.o from the gerrit certificate SNI list [puppet] - 10https://gerrit.wikimedia.org/r/535184 (https://phabricator.wikimedia.org/T229822) (owner: 10Vgutierrez) [13:31:09] !log installing facter update from buster 10.1 point release (T222356) [13:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:13] T222356: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 [13:31:27] (03PS2) 10Filippo Giunchedi: swift: port alerts to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/535182 (https://phabricator.wikimedia.org/T205870) [13:38:18] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) 05Openβ†’03Resolved a:03Ottomata [13:38:36] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Jclark-ctr) host row unit port cloudcephosd1001 b7 27 39/25 cloudcephosd1002 b4 12 43/42 cloudcephosd... [13:39:49] !log uploaded trafficserver 8.0.5-1wm6 to apt.wikimedia.org (stretch) - T232298 [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:52] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [13:46:33] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:03] 10Operations, 10Puppet, 10cloud-services-team: labspuppetmaster1001 puppet-merge failing - https://phabricator.wikimedia.org/T232322 (10JHedden) 05Openβ†’03Resolved a:03JHedden Fixed. The changes for commit 2e5424b4010c29da463eaf3c4ca2898c0a8fb79d were applied, but for some odd reason not the commit? I'v... [13:51:00] !log upgrading ats to 8.0.5-1wm6 on cp5001 - T232298 [13:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:04] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [13:51:18] 10Operations, 10Accuracy-Review-of-Wikipedias, 10Bad-Words-Detection-System, 10Better Use Of Data, and 88 others: Deprecate jquery.throttle-debounce in favour of OO.ui.debounce/throttle - https://phabricator.wikimedia.org/T213426 (10GoogleLegacy) [13:51:23] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10MoritzMuehlenhoff) The fix landed in Buster 10.1 and I've rolled it out to our Buster hosts. I think we can close this task. [13:59:48] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add a new command to combine a deployment with a library restart check [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/533911 (owner: 10Muehlenhoff) [14:02:31] 10Operations, 10Wikimedia-Incident: September 2019 DoS attack [Public] - https://phabricator.wikimedia.org/T232224 (10CDanis) [14:03:28] (03PS2) 10Ottomata: analytics::refinery::job::data_purge.pp Add skip-trash to timers [puppet] - 10https://gerrit.wikimedia.org/r/533955 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [14:03:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] analytics::refinery::job::data_purge.pp Add skip-trash to timers [puppet] - 10https://gerrit.wikimedia.org/r/533955 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [14:04:26] Urbanecm: Thanks for deploying the MediaInfo UBN, now I don’t have to. :-) [14:04:56] James_F: happy to help, as always ;) [14:09:33] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10MoritzMuehlenhoff) 05Openβ†’03Resolved [14:09:38] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) [14:19:54] (03PS4) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) [14:19:56] (03PS3) 10CDanis: dbctl: add set-note instance subcommand [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) [14:20:00] (03CR) 10CDanis: dbctl: add set-candidate-master subcommand on instance (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:20:02] (03PS1) 10Ema: etherpad: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/535194 (https://phabricator.wikimedia.org/T210411) [14:20:07] (03CR) 10CDanis: dbctl: add set-note instance subcommand (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:21:44] (03PS1) 10Ema: secret: dummy key for etherpad [labs/private] - 10https://gerrit.wikimedia.org/r/535195 (https://phabricator.wikimedia.org/T210411) [14:22:08] !log bootstrapping Cassandra, restbase-dev1004-a -- T224554 [14:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [14:22:25] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:36] (03CR) 10jerkins-bot: [V: 04-1] dbctl: add set-note instance subcommand [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:23:09] (03CR) 10jerkins-bot: [V: 04-1] dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:26:48] (03PS1) 10Ema: etherpad: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/535201 (https://phabricator.wikimedia.org/T210411) [14:28:17] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for etherpad [labs/private] - 10https://gerrit.wikimedia.org/r/535195 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:28:28] (03CR) 10Volans: [C: 03+1] "Code looks good (modulo flake8 nit ;) ). Replies inline" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:30:57] (03PS2) 10Ema: etherpad: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/535201 (https://phabricator.wikimedia.org/T210411) [14:32:00] (03CR) 10Marostegui: dbctl: add set-candidate-master subcommand on instance (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:34:01] 10Operations, 10Mail: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10MoritzMuehlenhoff) [14:35:05] (03PS1) 10Marostegui: mariadb: Decommission db2047 [puppet] - 10https://gerrit.wikimedia.org/r/535203 (https://phabricator.wikimedia.org/T231852) [14:35:17] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:43] 10Operations, 10SDC General, 10Wikidata, 10Discovery-Search (Current work): Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Mathew.onipe) [14:41:33] (03CR) 10Ema: [C: 03+2] etherpad: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/535194 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:42:04] (03CR) 10Ema: [C: 03+2] "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/18215/" [puppet] - 10https://gerrit.wikimedia.org/r/535201 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:42:49] 10Operations, 10SDC General, 10Wikidata, 10Discovery-Search (Current work): Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Mathew.onipe) [14:43:06] (03CR) 10Ayounsi: transports: add JunOS transport (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [14:45:12] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [14:46:37] 10Operations, 10SDC General, 10Wikidata, 10Discovery-Search (Current work): Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Mathew.onipe) [14:49:19] 10Operations, 10SDC General, 10Wikidata, 10Discovery-Search (Current work): Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Mathew.onipe) [14:52:50] (03PS1) 10Ema: etherpad: set TLS port to 7443 [puppet] - 10https://gerrit.wikimedia.org/r/535204 (https://phabricator.wikimedia.org/T210411) [14:53:32] (03PS18) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [14:54:29] (03CR) 10CRusnov: [C: 03+1] "thanks, there are several details to work out stil,l obv. :)" [puppet] - 10https://gerrit.wikimedia.org/r/535151 (owner: 10Jbond) [14:54:40] (03CR) 10Ema: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18216/" [puppet] - 10https://gerrit.wikimedia.org/r/535204 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:54:57] (03CR) 10Ayounsi: "> Patch Set 1:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [14:55:18] (03PS3) 10CRusnov: netbox::postgres; fix username for netbox connection [puppet] - 10https://gerrit.wikimedia.org/r/534856 [14:56:08] (03CR) 10Holger Knust: "Thanks for the review Volans! Uploaded new patch set. I deferred some of the comments to a future release" (0312 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [14:56:24] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/535151 (owner: 10Jbond) [14:56:35] (03PS2) 10Jbond: netbox: add python3-pynetbox [puppet] - 10https://gerrit.wikimedia.org/r/535151 [14:56:38] (03PS1) 10Ottomata: Disable mysql-eventbus eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/535205 (https://phabricator.wikimedia.org/T232349) [14:57:23] (03CR) 10Jbond: [C: 03+2] netbox: add python3-pynetbox [puppet] - 10https://gerrit.wikimedia.org/r/535151 (owner: 10Jbond) [14:59:03] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10Papaul) [14:59:06] (03PS4) 10CRusnov: netbox::postgres; fix username for netbox connection [puppet] - 10https://gerrit.wikimedia.org/r/534856 [14:59:31] PROBLEM - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:22] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) 05Openβ†’03Resolved complete [15:00:25] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Papaul) [15:00:37] the etherpad1001 critical is my fault [15:00:46] (03PS6) 10Volans: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) [15:00:47] (03PS5) 10Volans: config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) [15:00:49] (03PS6) 10Volans: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) [15:00:51] (03PS3) 10Volans: config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 [15:00:53] (03CR) 10Volans: "Addressed comment" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:01:07] RECOVERY - Check systemd state on etherpad1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:28] (03CR) 10CRusnov: [C: 03+2] netbox::postgres; fix username for netbox connection [puppet] - 10https://gerrit.wikimedia.org/r/534856 (owner: 10CRusnov) [15:05:56] (03CR) 10Volans: "quick reply inline, I didn't look at latest PS18" (031 comment) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [15:10:52] (03PS1) 10Ottomata: Prep for installing an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/535209 (https://phabricator.wikimedia.org/T225128) [15:11:25] PROBLEM - Host mw2231.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:15] (03CR) 10Thcipriani: [C: 03+1] gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [15:16:15] (03CR) 10Paladox: [C: 03+1] "@Dzahn yup! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [15:17:05] RECOVERY - Host mw2231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [15:20:24] 10Operations, 10Security-Team: Remove Michal Anna Marble from security@ alias in exim - https://phabricator.wikimedia.org/T232352 (10sbassett) [15:20:31] (03PS1) 10Papaul: DNS: Remove DNS mgmt asset tag WMF6403 [dns] - 10https://gerrit.wikimedia.org/r/535211 (https://phabricator.wikimedia.org/T200210) [15:20:35] (03PS1) 10CRusnov: profile::netbox: Fix https ferm rule. [puppet] - 10https://gerrit.wikimedia.org/r/535210 [15:20:37] 10Operations, 10Security-Team: Remove Michal Anna Marble from security@ alias in exim - https://phabricator.wikimedia.org/T232352 (10sbassett) p:05Triageβ†’03Normal [15:21:17] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Fix https ferm rule. [puppet] - 10https://gerrit.wikimedia.org/r/535210 (owner: 10CRusnov) [15:22:57] (03PS2) 10CRusnov: profile::netbox: Fix https ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/535210 [15:25:21] (03PS1) 10Papaul: DNS: Change asset tag DNS for mw2231 [dns] - 10https://gerrit.wikimedia.org/r/535212 (https://phabricator.wikimedia.org/T231192) [15:25:50] (03CR) 10jerkins-bot: [V: 04-1] DNS: Change asset tag DNS for mw2231 [dns] - 10https://gerrit.wikimedia.org/r/535212 (https://phabricator.wikimedia.org/T231192) (owner: 10Papaul) [15:25:59] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Fix https ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/535210 (owner: 10CRusnov) [15:32:01] (03CR) 10Ayounsi: [C: 03+1] transports: add JunOS transport (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:33:10] (03CR) 10Ayounsi: [C: 03+2] transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:35:02] (03PS1) 10Papaul: DHCP: Change MAC address for mw2231 [puppet] - 10https://gerrit.wikimedia.org/r/535214 [15:36:00] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Change MAC address for mw2231 [puppet] - 10https://gerrit.wikimedia.org/r/535214 (owner: 10Papaul) [15:38:51] (03PS2) 10Papaul: DHCP: Change MAC address for mw2231 [puppet] - 10https://gerrit.wikimedia.org/r/535214 (https://phabricator.wikimedia.org/T231192) [15:40:18] (03Merged) 10jenkins-bot: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:41:08] (03Merged) 10jenkins-bot: config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:41:10] (03Merged) 10jenkins-bot: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:46:27] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:46:29] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:47:03] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:47:13] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:41] PROBLEM - configured eth on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:47:46] (03CR) 10jenkins-bot: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:47:51] the oom killer [15:47:53] PROBLEM - Disk space on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [15:47:53] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:48:03] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:48:05] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:48:14] (03CR) 10Muehlenhoff: [C: 03+2] DHCP: Change MAC address for mw2231 [puppet] - 10https://gerrit.wikimedia.org/r/535214 (https://phabricator.wikimedia.org/T231192) (owner: 10Papaul) [15:48:32] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Ramsey-WMF) [15:48:37] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:48:47] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:17] RECOVERY - configured eth on analytics-tool1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:49:27] RECOVERY - Disk space on analytics-tool1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [15:49:28] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:49:40] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Cmjohnson) a:05Cmjohnsonβ†’03Jclark-ctr this got lost in the shuffle....will work on it this week . @Jclark-ctr can you contact HPE support and open a ticket please. [15:50:21] (03PS19) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [15:51:25] (03CR) 10Holger Knust: table-properties: Initial commit (031 comment) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [15:57:01] (03PS1) 10Ottomata: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) [15:57:15] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10wiki_willy) a:03Cmjohnson [15:57:22] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) a:03Cmjohnson [15:57:23] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [15:58:26] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @Cmjohnson / @Jclark-ctr https://gerrit.wikimedia.org/r/535221 adds DNS for non mgmt entries.... [16:00:35] (03CR) 10Bstorm: [C: 03+1] "I think this should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/531241 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [16:01:48] (03CR) 10Krinkle: Direct Parsoid/PHP rt-testing log events to a different target (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [16:04:26] (03PS2) 10Subramanya Sastry: Direct Parsoid/PHP rt-testing log events to a different target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) [16:09:03] (03CR) 10jenkins-bot: config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:10:45] (03CR) 10jenkins-bot: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:14:23] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:14:32] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [16:14:49] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [16:16:27] RECOVERY - Host mw2231 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:19:03] PROBLEM - Nginx local proxy to apache on mw2231 is CRITICAL: connect to address 10.192.0.57 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [16:19:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) a:05Papaulβ†’03MoritzMuehlenhoff System replacement complete - update Netbox - Switch port re-enable @MoritzMuehlenhoff the system is ready for re-image. [16:19:41] PROBLEM - Check systemd state on mw2231 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:07] PROBLEM - Check whether ferm is active by checking the default input chain on mw2231 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:22:14] (03PS2) 10Ottomata: Disable mysql-eventbus eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/535205 (https://phabricator.wikimedia.org/T232349) [16:22:49] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Disable mysql-eventbus eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/535205 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:23:24] (03CR) 10Elukey: [C: 03+1] Disable mysql-eventbus eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/535205 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:24:16] (03PS1) 10CRusnov: postgres::slave: remove type hint to address debt [puppet] - 10https://gerrit.wikimedia.org/r/535224 [16:27:20] (03PS1) 10Ottomata: Use $ensure for all resources in eventlogging::service::consumer [puppet] - 10https://gerrit.wikimedia.org/r/535225 (https://phabricator.wikimedia.org/T232349) [16:28:42] (03CR) 10Ottomata: [C: 03+2] Use $ensure for all resources in eventlogging::service::consumer [puppet] - 10https://gerrit.wikimedia.org/r/535225 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:28:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks okay as a workaround to unbreak the status quo!" [puppet] - 10https://gerrit.wikimedia.org/r/535224 (owner: 10CRusnov) [16:28:55] (03CR) 10CRusnov: "puppetdb now compiles with this patch (noop https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18217/console). Ev" [puppet] - 10https://gerrit.wikimedia.org/r/535224 (owner: 10CRusnov) [16:30:25] (03PS2) 10CRusnov: postgres::slave: remove type hint to address debt [puppet] - 10https://gerrit.wikimedia.org/r/535224 [16:32:03] (03CR) 10Ori.livneh: "I'm abandoning this. I think there are better approaches. I'll follow up on Phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [16:32:24] (03Abandoned) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [16:32:28] PROBLEM - Check systemd state on eventlog1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:43] (03CR) 10CRusnov: [C: 03+2] postgres::slave: remove type hint to address debt [puppet] - 10https://gerrit.wikimedia.org/r/535224 (owner: 10CRusnov) [16:32:55] (03PS1) 10Ottomata: Remove unused eventlogging mysql consumer eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/535226 (https://phabricator.wikimedia.org/T232349) [16:34:51] RECOVERY - Nginx local proxy to apache on mw2231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:35:14] (03CR) 10Ottomata: [C: 03+2] Remove unused eventlogging mysql consumer eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/535226 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:35:23] RECOVERY - Check whether ferm is active by checking the default input chain on mw2231 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:35:25] (03PS2) 10Ottomata: Remove unused eventlogging mysql consumer eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/535226 (https://phabricator.wikimedia.org/T232349) [16:35:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove unused eventlogging mysql consumer eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/535226 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:35:33] RECOVERY - Check systemd state on mw2231 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:10] (03CR) 10Nuria: [C: 03+1] "Feng shui!" [puppet] - 10https://gerrit.wikimedia.org/r/535226 (https://phabricator.wikimedia.org/T232349) (owner: 10Ottomata) [16:44:21] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages. [16:44:47] PROBLEM - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:59] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Date TBD) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) [16:50:00] !log replacing Fan kit and power supplies on cr1-codfw [16:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:25] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Groceryheist) I deleted a couple Gb that I don't need. Unfortunately most of the space I'm using is from ORES assets so I can't really store it in Hadoop. Maybe I should move this work... [16:54:31] 10Operations, 10ops-eqiad: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10RobH) [16:57:47] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) @Groceryheist Can you explain a bit why ORES assets cannot be stored in hadoop? [17:00:04] gehel and onimisionipe: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T1700). [17:00:45] jouncebot: no deploy today! [17:02:34] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) p:05Triageβ†’03Normal [17:02:51] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:03:06] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:03:40] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:04:11] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) I've set the due date via the require by date on the ordering task: Need By Date Mid Q1, to get enough time to be fully in service before EOQ [17:11:17] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) It looks like we have about 46G available for now, so hopefully that can hold us over. If you don't mind, just keep an eye on usage, and if it gets close to full, delete thing... [17:13:15] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:14:33] (03CR) 10Bstorm: [C: 03+2] tagging: Add the tag to the templates [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534846 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [17:15:05] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) Thank you @MoritzMuehlenhoff for your efforts getting the backport submitted and accepted upst... [17:20:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) a:05wiki_willyβ†’03Cmjohnson Here's the response I got from Dell (pasted below). @cmjohnson or @Jclark-ctr... [17:22:38] (03PS3) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) [17:26:27] (03PS4) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) [17:44:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:51:33] (03PS7) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [17:52:52] (03PS8) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [17:55:33] !log push cloudflare tunnel config to cr1-eqsin [17:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:24] !log disabling puppet on labpuppetmaster1001 as part of T171188 [17:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:27] T171188: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:05:55] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:07:05] (03PS2) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [18:12:32] (03PS9) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [18:12:34] (03PS3) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [18:12:36] (03PS2) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [18:12:38] (03PS1) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [18:13:08] (03CR) 10Thcipriani: [C: 03+1] "Looks good from the scap perspective!" (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [18:14:19] (03CR) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:18:24] (03PS3) 10Andrew Bogott: cloud: Switch encapi calls to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530340 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:20:11] MaxSem, RoanKattouw, Niharika, and Urbanecm: would you please be available for a last minute backport? [18:20:30] Daimona: morning or evening swat? [18:20:34] (03CR) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:20:40] * Urbanecm is in an evening train, but if it is urgent, can do [18:20:47] * Urbanecm has about 15 mins [18:20:52] The one right now :D [18:21:34] Sure [18:21:42] It's not so critical [18:21:58] I also don't know how much time I can stay around here [18:27:31] I'm about to arrive, sorry. Ping me tomorrow! [18:28:00] (03PS1) 10Jayprakash12345: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) [18:29:38] (03CR) 10Krinkle: [C: 04-1] Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:30:00] (03CR) 10Jayprakash12345: "@Urbanecm Can you help me to check images?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [18:30:13] (03CR) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:31:04] (03PS4) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [18:31:18] Urbanecm: no problem :) I don't think we need it so urgently [18:44:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:46:07] (03PS1) 10Krinkle: logging: Remove unused 'logstash' formatter since 'cee' adoption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) [18:46:11] (03PS1) 10Krinkle: logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) [18:47:11] (03CR) 10Andrew Bogott: [C: 03+2] cloud: Switch encapi calls to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530340 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:51:51] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:52:05] woop one sec [18:53:25] (03CR) 10Krinkle: [C: 03+1] Direct Parsoid/PHP rt-testing log events to a different target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [18:53:51] (03CR) 10Krinkle: [C: 03+1] "Nice! I've excluded this from various dashboards already but would be nice not to have to do that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [18:54:35] (03PS1) 10Ottomata: Remove eventbus from LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535265 (https://phabricator.wikimedia.org/T232122) [18:54:53] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:55:09] (03PS4) 10Andrew Bogott: cloud: Change monitoring things to look at new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530344 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:57:16] (03CR) 10Andrew Bogott: [C: 03+2] cloud: Change monitoring things to look at new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530344 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [19:05:03] (03PS5) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [19:05:29] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:05:36] (03CR) 10jerkins-bot: [V: 04-1] Remove eventbus from LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535265 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:09:43] (03Abandoned) 10Ottomata: Remove eventbus from LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535265 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:14:48] (03PS1) 10Ottomata: Decomission eventlogging-service-eventbus in beta / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/535269 (https://phabricator.wikimedia.org/T232122) [19:14:49] (03PS6) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [19:14:51] (03PS4) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [19:15:28] (03CR) 10Ottomata: [C: 03+2] "no-op in prod: https://puppet-compiler.wmflabs.org/compiler1001/18218/kafka-main1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535269 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:15:59] (03CR) 10Andrew Bogott: [C: 03+2] cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [19:18:13] hello! thanks for all the work you did during the weekend. [19:18:47] (03PS2) 10Ottomata: Decomission eventlogging-service-eventbus in beta / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/535269 (https://phabricator.wikimedia.org/T232122) [19:18:52] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Decomission eventlogging-service-eventbus in beta / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/535269 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:19:07] I don't know if https://phabricator.wikimedia.org/T232362 is related, but I'd welcome your opinion on this. This effort is kinda crucial to CE, so your help would be truly appreciated. TYVM! [19:19:49] Elitre: I'd suspect it's completely unrelated [19:20:41] 10Operations, 10Security-Team: Remove Michal Anna Marble from security@ alias in exim - https://phabricator.wikimedia.org/T232352 (10CDanis) 05Openβ†’03Resolved a:03CDanis [19:21:18] thanks Reedy . Don't know if this is bad or good news, but still :) [19:22:21] 10Operations, 10MassMessage, 10WMF-JobQueue: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Elitre) [19:30:32] 10Operations, 10MassMessage, 10WMF-JobQueue: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Aklapper) [19:37:09] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@533d541]: Update mobileapps to 01971d9 [19:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:14] !log fix eqsin CF tunnel missconfig [19:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:25] RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:40:09] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@533d541]: Update mobileapps to 01971d9 (duration: 02m 59s) [19:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:51] !log mobileapps deployment failed repooling canary (scb2001); retrying [19:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:17] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@533d541]: Update mobileapps to 01971d9 [19:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:55] (03PS5) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [19:44:57] (03PS1) 10Andrew Bogott: cloud recursors: alias 'puppet' to the cloud-internal puppetmaster IP [puppet] - 10https://gerrit.wikimedia.org/r/535275 (https://phabricator.wikimedia.org/T171188) [19:46:22] (03CR) 10Andrew Bogott: [C: 03+2] cloud recursors: alias 'puppet' to the cloud-internal puppetmaster IP [puppet] - 10https://gerrit.wikimedia.org/r/535275 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [19:48:02] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@533d541]: Update mobileapps to 01971d9 (duration: 05m 45s) [19:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) ping @ema now that ahem, things are a bit more quiet [19:59:40] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10AAlikhan) Hi - Just checking in on the above. Thanks. [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T2000). [20:09:50] 10Operations, 10MassMessage, 10WMF-JobQueue: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Pchelolo) I believe it's a new issue related to a switch from EventBus service to eventgate. See https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.0... [20:14:54] 10Operations, 10MassMessage, 10WMF-JobQueue: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Rmaung) Is there a way to know whether those messages will ever show up, or should I attempt to send again? [20:16:38] 10Operations, 10MassMessage, 10WMF-JobQueue: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Pchelolo) I don't think they will show up neither will resending help at this point, we need to fix the underlying issue first. [20:44:05] (03CR) 10Ayounsi: Homer deploy repo init (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [20:49:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:50:33] (03PS3) 10Andrew Bogott: Make puppetmaster CA content key be a hash keyed by puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/533758 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [20:50:39] !log bootstrapping Cassandra, restbase-dev1004-b -- T224554 [20:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:42] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [20:52:58] (03PS1) 10Ottomata: Increase default EventGate max_body_size to 10mb [deployment-charts] - 10https://gerrit.wikimedia.org/r/535286 (https://phabricator.wikimedia.org/T232362) [20:53:07] (03CR) 10Andrew Bogott: [C: 03+2] Make puppetmaster CA content key be a hash keyed by puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/533758 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [20:55:45] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Patch-For-Review: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Pchelolo) Ok, we've dug out the root cause of this. In the job queue system the maximum size of the serialized job is 4 mb, so the maximum body size... [20:56:33] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Pchelolo) [20:56:46] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Ottomata) It looks like this would have been a problem before the migration to EventGate too.... [21:00:04] Reedy and sbassett: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T2100). [21:11:16] (03PS5) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) [21:11:18] (03PS4) 10CDanis: dbctl: add set-note instance subcommand [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) [21:14:30] (03CR) 10Alaa Sarhan: [C: 03+1] mediawiki: Add rebuildItemTerms for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [21:15:28] (03PS6) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) [21:15:30] (03PS5) 10CDanis: dbctl: add set-note instance subcommand [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) [21:16:15] (03CR) 10CDanis: dbctl: add set-candidate-master subcommand on instance (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [21:17:33] (03CR) 10CDanis: dbctl: add set-candidate-master subcommand on instance (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [21:18:19] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/534899 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [21:23:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:25:18] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10cscott) Might be related to {T232390}. [21:26:57] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [21:27:07] (03PS1) 10Andrew Bogott: Revert "Make puppetmaster CA content key be a hash keyed by puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/535293 [21:30:01] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [21:32:01] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Make puppetmaster CA content key be a hash keyed by puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/535293 (owner: 10Andrew Bogott) [21:39:02] (03PS1) 10CDanis: some reminders for doing the next release [software/conftool] - 10https://gerrit.wikimedia.org/r/535296 [21:44:28] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [21:48:09] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [21:48:25] andrewbogott, ^ [21:48:30] that'll probably be us [21:48:46] yeah, I'll ack [21:49:35] The blog isn't us though is it? [21:50:47] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [21:51:08] no I hope not xD [21:57:41] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project Jhedden Impacted by labs puppetmaster maintenance https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [22:05:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:08:27] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:09:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:14:48] (03CR) 10BryanDavis: [C: 03+1] "one comment nit inline" (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [22:17:15] (03PS5) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) [22:17:49] (03CR) 10Bstorm: "Thanks for catching that!" (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [22:19:37] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:21:07] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:21:43] (03CR) 10Bstorm: [C: 03+2] sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [22:26:11] (03PS6) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [22:26:13] (03PS4) 10Andrew Bogott: cloud: Move instances to use new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530371 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [22:26:15] (03PS1) 10Andrew Bogott: Make puppetmaster CA content key be a hash keyed by puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/535305 (https://phabricator.wikimedia.org/T171188) [22:30:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10media-storage: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10CDanis) a:05CDanisβ†’03fgiunchedi Leaving some notes here before I'm gone for two weeks. * The script as it exists on `ms-fe1005:/srv/swiftrepl` matches git... [22:31:43] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) >>! In T224554#5470427, @Dzahn wrote: > @Eevans I recreated the certs for restbase-dev1004 through restb... [22:32:45] (03CR) 10Andrew Bogott: [C: 03+2] Make puppetmaster CA content key be a hash keyed by puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/535305 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [22:39:29] (03CR) 10Subramanya Sastry: "Okay to get this deployed in a swat window then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [22:43:37] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) >>! In T231086#5447327, @CDanis wrote: > On each Swift frontend host, I: > * grepped today's logs for GETs that resulted in... [22:48:11] 08Warning Alert for device cr1-eqiad.wikimedia.org - Memory over 85% [22:56:17] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:58:41] ACKNOWLEDGEMENT - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed Ayounsi https://phabricator.wikimedia.org/T232412 https://phabricator.wikimedia.org/tag/wikimedia-blog/ [22:59:22] Downtimed the blog alert for a week [22:59:27] and opened https://phabricator.wikimedia.org/T232412 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190909T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:54] o/ [23:02:56] 10Operations, 10Wikimedia-Mailing-lists: Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10Jrbranaa) Thanks @jbond. Appreciated. [23:04:05] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [23:04:35] RoanKattouw: are you free for swat? only person i can see available in the room :) [23:04:43] Yeah I can do it [23:05:22] +2ed, waiting for Jenkins now [23:11:03] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10CDanis) @BBlack @ema can you weigh in at some point soon with how feasible this seems? I'm pretty unfamiliar with the current setup h... [23:15:37] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10BBlack) @ema would know better about how difficult such things are with ATS in particular. I tend not to like this idea in general, t... [23:19:38] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10CDanis) Yeah, all fair points. We don't seem to be experiencing too many of these 404s (a handful per day), and other mitigations are... [23:20:02] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10CDanis) 05Openβ†’03Declined [23:20:05] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) [23:36:36] (03CR) 10Andrew Bogott: [C: 03+2] cloud: Move instances to use new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530371 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:37:40] Jdlrobson: OK looks like it finally merged a few minutes ago, will be on mwdebug1002 shortly [23:37:53] (03PS5) 10Andrew Bogott: cloud: Move instances to use new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530371 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:37:55] (03PS7) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [23:39:24] (03CR) 10Cwhite: [C: 03+1] swift: port alerts to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/535182 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [23:40:14] (03CR) 10Cwhite: [C: 03+1] grafana: use Prometheus swift metrics for dashboard [puppet] - 10https://gerrit.wikimedia.org/r/535180 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [23:40:31] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group Ayounsi https://phabricator.wikimedia.org/T231192 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:43:38] 10Operations, 10Wikimedia-Mailing-lists: mass AOL bounces on mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) [23:44:20] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.21/skins/MinervaNeue/: T232260 (duration: 00m 57s) [23:44:21] ACKNOWLEDGEMENT - Check systemd state on eventlog1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T232349 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:25] T232260: Hamburger and notifications menu icons arehidden in mobile view of RTL languages in Chrome - https://phabricator.wikimedia.org/T232260 [23:44:59] 10Operations, 10Wikimedia-Mailing-lists: mass AOL bounces on mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) [23:45:13] 10Operations, 10Wikimedia-Mailing-lists: mass AOL bounces on mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) [23:46:12] 10Operations, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) [23:47:06] ACKNOWLEDGEMENT - Check systemd state on netboxdb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov Workin on it. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:51] RECOVERY - Check systemd state on netboxdb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:01] PROBLEM - netbox Postgres on netboxdb2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB netbox (host:localhost) 27241368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:51:16] expected :) [23:55:36] ACKNOWLEDGEMENT - netbox Postgres on netboxdb2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB netbox (host:localhost) 27248744 and 0 seconds Cas Rusnov Fixing replication in progress. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring