[01:33:11] (03Abandoned) 10Ladsgroup: mediawiki: Disable query page updates for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/542915 (https://phabricator.wikimedia.org/T234948) (owner: 10Ladsgroup) [01:37:47] PROBLEM - Memory correctable errors -EDAC- on mw1252 is CRITICAL: 4 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1252&var-datasource=eqiad+prometheus/ops [01:47:16] 10Operations, 10Analytics: stat1005 cron spam from prometheus-amd-rocm-stat - https://phabricator.wikimedia.org/T236004 (10jijiki) [01:47:32] 10Operations, 10Analytics: stat1005 cron spam from prometheus-amd-rocm-stat - https://phabricator.wikimedia.org/T236004 (10jijiki) a:05elukey→03None [02:27:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:43] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:27:43] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:27:47] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:27:47] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:28:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:28:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:29:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:29:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:29:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:29:21] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:29:21] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:29:23] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:29:23] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [02:29:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:30:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:40:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25088968 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:42:15] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 54632 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:38:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:38:43] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:38:43] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:40:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:21] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:40:21] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:53:44] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for cache upload on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/544597 (https://phabricator.wikimedia.org/T231433) [04:53:46] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/544598 (https://phabricator.wikimedia.org/T231433) [04:57:44] !log Switch cp5006 from nginx to ats-tls - T231433 [04:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:48] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:59:00] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows the expected NOOP on cp[5001-5005]: https://puppet-compiler.wmflabs.org/compiler1001/18934/" [puppet] - 10https://gerrit.wikimedia.org/r/544597 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:04:59] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows the expected changes on cp5006 and a NOOP for cp[5001-5005]: https://puppet-compiler.wmflabs.org/compiler1001/18935/" [puppet] - 10https://gerrit.wikimedia.org/r/544598 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:06:47] PROBLEM - HTTPS Unified ECDSA on cp5006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:06:51] PROBLEM - HTTPS Unified RSA on cp5006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:07:07] ^^ expected [05:09:38] !log Deploy schema change on s7 primary master db1062 - T234066 T233135 [05:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:43] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:09:43] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:10:01] RECOVERY - HTTPS Unified ECDSA on cp5006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345529 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:10:05] RECOVERY - HTTPS Unified RSA on cp5006 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345534 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:13:28] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:13:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312 for schema change and pool db1129 temporarily in vslow, dump', diff saved to https://phabricator.wikimedia.org/P9395 and previous config saved to /var/cache/conftool/dbconfig/20191021-051356-marostegui.json [05:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:28] !log Deploy schema change on db1090:3312 T234066 T233135 [05:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:55] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1105 rebooted itself - https://phabricator.wikimedia.org/T235877 (10Marostegui) The data comparison finished correctly (still no HW) logs. I am going to give this host some weight to help out db1099:3311 so it doesn't get super cold. @Cmjohnson let us know whi... [05:17:43] (03PS1) 10Marostegui: db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/544603 (https://phabricator.wikimedia.org/T235877) [05:17:58] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for cache upload on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/544604 (https://phabricator.wikimedia.org/T231433) [05:18:00] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/544605 (https://phabricator.wikimedia.org/T231433) [05:18:35] (03CR) 10Marostegui: [C: 03+2] db1105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/544603 (https://phabricator.wikimedia.org/T235877) (owner: 10Marostegui) [05:19:28] !log Switch cp4026 from nginx to ats-tls - T231433 [05:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:32] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:20:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P9396 and previous config saved to /var/cache/conftool/dbconfig/20191021-052035-marostegui.json [05:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:17] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP for cp[4021-4025] and shows the expected changes on cp4026: https://puppet-compiler.wmflabs.org/compiler1001/18936/" [puppet] - 10https://gerrit.wikimedia.org/r/544604 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:25:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2088:3312 db2084:3315 - T235599', diff saved to https://phabricator.wikimedia.org/P9397 and previous config saved to /var/cache/conftool/dbconfig/20191021-052527-marostegui.json [05:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:32] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [05:25:49] PROBLEM - HTTPS Unified ECDSA on cp4026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:25:57] ^^ expected [05:26:33] PROBLEM - HTTPS Unified RSA on cp4026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P9398 and previous config saved to /var/cache/conftool/dbconfig/20191021-052643-marostegui.json [05:26:48] (03PS2) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/544605 (https://phabricator.wikimedia.org/T231433) [05:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:31] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows a NOOP on cp[4021-4025] and the expected changes on cp4026: https://puppet-compiler.wmflabs.org/compiler1001/18938/" [puppet] - 10https://gerrit.wikimedia.org/r/544605 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:28:05] (03PS3) 10Vgutierrez: hiera: Set ats-tls on port 443 for cache upload nodes on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/544605 (https://phabricator.wikimedia.org/T231433) [05:28:22] !log Compress tables on db2084:3314 db2091:3312 - T235599 [05:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:30:35] (03PS1) 10Giuseppe Lavagetto: lvs::monitor_services: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/544610 [05:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:48] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2053.codfw.wmnet` - db2053.codfw.wmnet (**PASS**) - Downtimed host on Ic... [05:31:18] (03PS1) 10Marostegui: site.pp: Remove db2053 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/544611 (https://phabricator.wikimedia.org/T231407) [05:31:25] RECOVERY - HTTPS Unified ECDSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345565 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:31:38] <_joe_> is CI working at all? [05:31:51] _joe_: It has worked for me a few minutes ago [05:31:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs::monitor_services: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/544610 (owner: 10Giuseppe Lavagetto) [05:31:56] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2053 [dns] - 10https://gerrit.wikimedia.org/r/544612 (https://phabricator.wikimedia.org/T231407) [05:32:03] <_joe_> ok I'll wait then [05:32:07] yup... same here [05:32:13] RECOVERY - HTTPS Unified RSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345517 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:32:23] merged 4 changes without needing to V:+2 yet [05:32:35] <_joe_> it's always frustrating when CI takes less than 1 minute to run on your computer, but takes several in production [05:32:41] <_joe_> should go the other way around :P [05:32:41] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2053 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/544611 (https://phabricator.wikimedia.org/T231407) (owner: 10Marostegui) [05:32:52] _joe_: 1 minute for me :) [05:33:05] <_joe_> which, considering you changed one file [05:33:13] <_joe_> is about 40 seconds too much [05:33:17] haha [05:33:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2053 [dns] - 10https://gerrit.wikimedia.org/r/544612 (https://phabricator.wikimedia.org/T231407) (owner: 10Marostegui) [05:33:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) a:05RobH→03Papaul [05:34:08] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) Host ready for on-site steps and switch disablement [05:35:06] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:35:50] (03CR) 10Marostegui: [C: 03+2] Add MachineVision tables/columns to filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/544233 (https://phabricator.wikimedia.org/T235887) (owner: 10Mholloway) [05:35:55] 10Operations, 10Cassandra, 10Core Platform Team Legacy (Later), 10User-Eevans: Upload 3.11.4 packages to APT repo - https://phabricator.wikimedia.org/T235675 (10Joe) 05Open→03Resolved [05:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1105:3311', diff saved to https://phabricator.wikimedia.org/P9399 and previous config saved to /var/cache/conftool/dbconfig/20191021-053737-marostegui.json [05:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:14] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3044 [puppet] - 10https://gerrit.wikimedia.org/r/544613 (https://phabricator.wikimedia.org/T231433) [05:38:16] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3044 [puppet] - 10https://gerrit.wikimedia.org/r/544614 (https://phabricator.wikimedia.org/T231433) [05:38:44] !log Switch cp3044 from nginx to ats-tls - T231433 [05:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:47] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:39:20] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3044 [puppet] - 10https://gerrit.wikimedia.org/r/544613 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:40:48] (03PS1) 10Marostegui: instances.yaml: Remove db2048,db2061 [puppet] - 10https://gerrit.wikimedia.org/r/544615 (https://phabricator.wikimedia.org/T228258) [05:41:44] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2048,db2061 [puppet] - 10https://gerrit.wikimedia.org/r/544615 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [05:42:16] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3044 [puppet] - 10https://gerrit.wikimedia.org/r/544614 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:42:56] <_joe_> !log slowly removing service objects from production etcd T233973 [05:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:00] T233973: remove service objects from etcd and update documentation - https://phabricator.wikimedia.org/T233973 [05:43:22] 10Operations, 10Analytics: stat1005 cron spam from prometheus-amd-rocm-stat - https://phabricator.wikimedia.org/T236004 (10elukey) 05Open→03Resolved a:03elukey Thanks a lot of the ping, this is my bad. To debug an issue I updated the rocm-smi package (was left untouched from a previous ROCm update) but t... [05:43:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2048 and db2061, those hosts will be decommissioned T228258', diff saved to https://phabricator.wikimedia.org/P9400 and previous config saved to /var/cache/conftool/dbconfig/20191021-054340-marostegui.json [05:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:45] T228258: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 [05:43:55] PROBLEM - HTTPS Unified RSA on cp3044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:43:59] ^^ expected [05:48:20] 10Operations, 10conftool: remove service objects from etcd and update documentation - https://phabricator.wikimedia.org/T233973 (10Joe) 05Open→03Resolved [05:49:20] RECOVERY - HTTPS Unified RSA on cp3044 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 597204 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 351 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:49:22] 10Operations, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Joe) 05Open→03Stalled a:05Joe→03None changing to stalled, and vacating assignment. [05:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1105:3311', diff saved to https://phabricator.wikimedia.org/P9401 and previous config saved to /var/cache/conftool/dbconfig/20191021-055017-marostegui.json [05:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:39] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:50:54] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) a:03Joe [05:54:38] (03PS1) 10Vgutierrez: hiera: Move nginx from port 4443 to 443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544617 (https://phabricator.wikimedia.org/T231433) [05:54:40] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544618 (https://phabricator.wikimedia.org/T231433) [05:54:51] !log Switch cp2017 from nginx to ats-tls - T231433 [05:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:55] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [05:55:41] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544617 (https://phabricator.wikimedia.org/T231433) [05:56:32] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544617 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight 100 to db1130 on s5 to check for slow queries T223151', diff saved to https://phabricator.wikimedia.org/P9402 and previous config saved to /var/cache/conftool/dbconfig/20191021-055843-marostegui.json [05:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:48] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [05:59:04] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544618 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [05:59:15] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2017 [puppet] - 10https://gerrit.wikimedia.org/r/544618 (https://phabricator.wikimedia.org/T231433) [06:01:38] PROBLEM - HTTPS Unified ECDSA on cp2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:02:07] ^^ expected [06:02:18] PROBLEM - HTTPS Unified RSA on cp2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:03:02] RECOVERY - HTTPS Unified ECDSA on cp2017 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345577 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:03:31] 10Operations, 10serviceops, 10Kubernetes: New Deployment charts should allow exposing services via TLS - https://phabricator.wikimedia.org/T236008 (10Joe) [06:03:42] RECOVERY - HTTPS Unified RSA on cp2017 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345537 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:07:40] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:09:32] (03Abandoned) 10KartikMistry: Enable Compact Language Links by default in Beta Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415506 (owner: 10KartikMistry) [06:09:34] (03CR) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (owner: 10Giuseppe Lavagetto) [06:10:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:10:27] (03PS1) 10Marostegui: site.pp: Remove puppet references for dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/544619 (https://phabricator.wikimedia.org/T233207) [06:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:34] (03PS1) 10Marostegui: wmnet: Remove production dns entries dbproxy1006 [dns] - 10https://gerrit.wikimedia.org/r/544620 (https://phabricator.wikimedia.org/T233207) [06:11:57] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/544619 (https://phabricator.wikimedia.org/T233207) (owner: 10Marostegui) [06:12:18] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production dns entries dbproxy1006 [dns] - 10https://gerrit.wikimedia.org/r/544620 (https://phabricator.wikimedia.org/T233207) (owner: 10Marostegui) [06:12:19] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1086 [puppet] - 10https://gerrit.wikimedia.org/r/544621 (https://phabricator.wikimedia.org/T231433) [06:12:22] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1086 [puppet] - 10https://gerrit.wikimedia.org/r/544622 (https://phabricator.wikimedia.org/T231433) [06:12:24] !log Switch cp1086 from nginx to ats-tls - T231433 [06:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:27] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [06:13:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Marostegui) a:05Marostegui→03Cmjohnson [06:13:22] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1086 [puppet] - 10https://gerrit.wikimedia.org/r/544621 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [06:13:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Marostegui) Host ready for on-site steps and switch disablement [06:14:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Marostegui) a:05Cmjohnson→03Jclark-ctr [06:15:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P9403 and previous config saved to /var/cache/conftool/dbconfig/20191021-061518-marostegui.json [06:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1086 [puppet] - 10https://gerrit.wikimedia.org/r/544622 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [06:16:13] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1086 [puppet] - 10https://gerrit.wikimedia.org/r/544622 (https://phabricator.wikimedia.org/T231433) [06:16:31] (03PS3) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) [06:17:52] PROBLEM - HTTPS Unified ECDSA on cp1086 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:18:00] PROBLEM - HTTPS Unified RSA on cp1086 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:18:06] ^^ expected [06:19:56] RECOVERY - HTTPS Unified ECDSA on cp1086 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345589 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:20:00] RECOVERY - HTTPS Unified RSA on cp1086 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345583 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:22:03] <_joe_> marostegui: are you changing something in mediawiki-config in etcd right now? [06:23:36] <_joe_> yes you are [06:24:12] _joe_: yes, I was a few minutes ago [06:24:32] <_joe_> there are bogus warnings about etcdconfig [06:24:46] <_joe_> I think volans needs to review how the check works [06:24:57] dbctl config diff is clean [06:25:04] <_joe_> yeah don't worryt [06:25:18] <_joe_> the alerts are for etcd being fresher than expected :P [06:26:04] (03PS2) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [06:26:06] (03PS1) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [06:26:35] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:28:35] !log Install python3-cryptography-2.6.1-3+deb10u2 on acme-chief hosts - T234131 [06:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:39] T234131: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 [06:33:18] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) >>! In T234131#5587004, @MoritzMuehlenhoff wrote: > @Vgutierrez I created a 2.6.1-3+deb10u2, it's in my home on acmechief1001. Let's deploy this on acmechief* hosts on Monday befor... [06:45:32] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3030 [puppet] - 10https://gerrit.wikimedia.org/r/544635 (https://phabricator.wikimedia.org/T231627) [06:45:34] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3030 [puppet] - 10https://gerrit.wikimedia.org/r/544636 (https://phabricator.wikimedia.org/T231627) [06:46:21] !log Switch from nginx to ats-tls on cp3030 - T231627 [06:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:26] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [06:46:57] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3030 [puppet] - 10https://gerrit.wikimedia.org/r/544635 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:50:07] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3030 [puppet] - 10https://gerrit.wikimedia.org/r/544636 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:52:18] PROBLEM - HTTPS Unified RSA on cp3030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:52:39] ^^ expected [06:53:54] RECOVERY - HTTPS Unified RSA on cp3030 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 593330 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 351 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:56:49] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:59:47] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/544648 (https://phabricator.wikimedia.org/T231627) [06:59:50] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/544649 (https://phabricator.wikimedia.org/T231627) [06:59:56] !log Switch from nginx to ats-tls on cp2001 - T231627 [07:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:00] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [07:00:47] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/544648 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [07:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on s1 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9404 and previous config saved to /var/cache/conftool/dbconfig/20191021-070119-marostegui.json [07:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:24] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [07:03:02] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/544649 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [07:03:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights from 1 to 100 on s1 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9405 and previous config saved to /var/cache/conftool/dbconfig/20191021-070352-marostegui.json [07:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:26] PROBLEM - HTTPS Unified ECDSA on cp2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:05:55] ^^ expected [07:06:42] RECOVERY - HTTPS Unified ECDSA on cp2001 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345559 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool non partitioned db1089 into s1 special slaves to check for slow queries T223151', diff saved to https://phabricator.wikimedia.org/P9406 and previous config saved to /var/cache/conftool/dbconfig/20191021-070655-marostegui.json [07:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:00] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [07:10:42] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:13:51] (03PS2) 10Muehlenhoff: Remove late-install hack for puppet 4 installation [puppet] - 10https://gerrit.wikimedia.org/r/543816 (https://phabricator.wikimedia.org/T228657) [07:14:13] <_joe_> marostegui: I keep seeing alerts (this time, worse ones) about etcdconfig in icinga [07:14:20] <_joe_> and they take time to go away [07:14:25] <_joe_> I think it's a fault of our checks [07:14:52] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/544655 (https://phabricator.wikimedia.org/T231627) [07:14:58] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/544656 (https://phabricator.wikimedia.org/T231627) [07:15:12] !log Switch from nginx to ats-tls on cp1075 - T231627 [07:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:16] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [07:15:57] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/544655 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [07:17:15] (03PS3) 10Muehlenhoff: Remove late-install hack for puppet 4 installation [puppet] - 10https://gerrit.wikimedia.org/r/543816 (https://phabricator.wikimedia.org/T228657) [07:18:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove late-install hack for puppet 4 installation [puppet] - 10https://gerrit.wikimedia.org/r/543816 (https://phabricator.wikimedia.org/T228657) (owner: 10Muehlenhoff) [07:19:14] (03CR) 10Marostegui: "> @marostegui i can already connect from the new host without this" [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [07:20:41] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/544656 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [07:20:54] PROBLEM - HTTPS Unified RSA on cp1075 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:20:58] ^^ expected [07:21:26] PROBLEM - HTTPS Unified ECDSA on cp1075 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:23:48] RECOVERY - HTTPS Unified RSA on cp1075 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345552 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:24:22] RECOVERY - HTTPS Unified ECDSA on cp1075 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345518 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 32 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:27:22] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy1006 [software] - 10https://gerrit.wikimedia.org/r/544659 (https://phabricator.wikimedia.org/T233207) [07:28:01] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Remove dbproxy1006 [software] - 10https://gerrit.wikimedia.org/r/544659 (https://phabricator.wikimedia.org/T233207) (owner: 10Marostegui) [07:30:41] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:30:55] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [07:32:08] !log depool cp4029 and reimage as text_ats T227432 [07:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:12] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [07:32:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp4029 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/544181 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:32:52] (03PS2) 10Ema: cache: reimage cp4029 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/544181 (https://phabricator.wikimedia.org/T227432) [07:35:15] !log installing openjdk-11 security updates [07:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4029.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [07:40:19] (03CR) 10Muehlenhoff: [C: 03+1] mariadb/ferm_misc: allow moscovium to connect to rt database [puppet] - 10https://gerrit.wikimedia.org/r/544079 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [07:41:44] (03PS3) 10Jcrespo: [WIP]bacula:Add 1st version of backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [07:41:46] (03PS1) 10Jcrespo: backup: Migrate bacula director from helium to backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) [07:42:06] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/544666 (https://phabricator.wikimedia.org/T231433) [07:42:08] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/544667 (https://phabricator.wikimedia.org/T231433) [07:42:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP]bacula:Add 1st version of backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [07:42:44] (03CR) 10Jcrespo: "First version." [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [07:43:02] !log Switch from nginx to ats-tls on cp3045 - T231627 [07:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:06] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [07:44:13] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/544666 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [07:45:49] !log installing aspell security updates on jessie [07:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:21] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3045 [puppet] - 10https://gerrit.wikimedia.org/r/544667 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [07:48:31] PROBLEM - HTTPS Unified ECDSA on cp3045 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [07:48:49] ^^ expected [07:49:35] RECOVERY - HTTPS Unified ECDSA on cp3045 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 557230 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 351 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:50:18] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4029.ulsfo.wmnet,service=ats-be [07:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:34] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [07:56:44] (03PS1) 10Ema: wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) [07:56:54] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3046 [puppet] - 10https://gerrit.wikimedia.org/r/544673 (https://phabricator.wikimedia.org/T231433) [07:56:56] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3046 [puppet] - 10https://gerrit.wikimedia.org/r/544674 (https://phabricator.wikimedia.org/T231433) [07:57:00] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [07:57:02] !log Switch from nginx to ats-tls on cp3046 - T231627 [07:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:05] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [07:57:52] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3046 [puppet] - 10https://gerrit.wikimedia.org/r/544673 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [07:59:03] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:17] _joe_: were they transient warning? re:etcd freshness on icinga [07:59:22] jouncebot: now [07:59:22] No deployments scheduled for the next 2 hour(s) and 30 minute(s) [07:59:26] jouncebot: next [07:59:26] In 2 hour(s) and 30 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1030) [07:59:42] <_joe_> volans: both warnings and then alerts [07:59:49] <_joe_> but they stayed red for like 8 minutes [08:00:16] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3046 [puppet] - 10https://gerrit.wikimedia.org/r/544674 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:00:30] _joe_: at which time? I cannot see them here in my backlog [08:00:44] <_joe_> not made it to icinga apparently [08:00:47] <_joe_> err to irc [08:00:53] <_joe_> I am looking at icinga [08:01:13] ok, I'll look there. There is an expected race given that we cache the value on icinga side to avoid to do hundreds of calls to etcd [08:01:53] so it's possible that few checks see that mw hosts are ahead, but should be very transient [08:02:19] <_joe_> yeah that's what perplexed me [08:02:42] the cache is refreshed every 30s [08:02:52] I'll check the logs [08:02:52] thanks for the ping [08:03:52] !log swift codfw-prod: final weight to ms-be205[1-6] - T233638 [08:03:52] also: greetings [08:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:52] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [08:03:52] May I deploy a security patch now? [08:04:03] (03PS7) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [08:09:30] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:13:12] _joe_: ofc your comment here are at 6:24 UTC... hence I only have the after logs :( [08:13:20] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4029.ulsfo.wmnet'] ` and were **ALL** successful. [08:13:28] <_joe_> volans: sigh :P [08:14:14] from which it seems that it's all normal, recovery are HARD and warning are only SOFT, there are few critical SOFT because of hosts that were caught right after a change in the master that didn't yet catch up [08:14:19] all expected from what I can see [08:14:41] <_joe_> yeah it was more the time to recovery that puzzled me [08:14:56] ah wait, we duplicate logs, so I can have them, give me a sec [08:15:23] 10Operations, 10serviceops, 10Kubernetes, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Joe) [08:16:57] _joe_: I suspect the longer time is because there were many commits (12 in total, some within few minutes from each other) [08:17:31] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:18:44] (03PS1) 10Ema: wdqs: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/544770 (https://phabricator.wikimedia.org/T210411) [08:19:03] <_joe_> volans: no I dont' think that's the case [08:19:09] <_joe_> I think check lags or something like that [08:19:19] <_joe_> anyways, we will see next time it happens I guess [08:19:53] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2020 [puppet] - 10https://gerrit.wikimedia.org/r/544771 (https://phabricator.wikimedia.org/T231433) [08:19:57] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2020 [puppet] - 10https://gerrit.wikimedia.org/r/544772 (https://phabricator.wikimedia.org/T231433) [08:20:05] so far I can only see "newer" warnings, then the update of the cache and than OK recoveries, and the update of the cache is always within 30s of the first "newer" warning [08:20:28] !log Switch from nginx to ats-tls on cp2020 - T231627 [08:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:32] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [08:20:38] so yeah let's see if it happen again [08:20:49] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2020 [puppet] - 10https://gerrit.wikimedia.org/r/544771 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:22:06] (03PS1) 10Ema: secret: dummy key for wdqs [labs/private] - 10https://gerrit.wikimedia.org/r/544773 (https://phabricator.wikimedia.org/T210411) [08:23:44] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2020 [puppet] - 10https://gerrit.wikimedia.org/r/544772 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:25:11] PROBLEM - HTTPS Unified ECDSA on cp2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:25:13] PROBLEM - HTTPS Unified RSA on cp2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:25:46] ^^ expected [08:26:27] RECOVERY - HTTPS Unified ECDSA on cp2020 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345571 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:27] RECOVERY - HTTPS Unified RSA on cp2020 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345571 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:33] (03PS8) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [08:26:35] (03PS3) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [08:26:37] <_joe_> vgutierrez: you could just downtime the hosts ;) [08:26:37] (03PS2) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [08:26:40] (03PS1) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [08:26:51] (03CR) 10jerkins-bot: [V: 04-1] blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [08:27:18] <_joe_> oh great [08:27:28] <_joe_> CI caught a WTF of mine! [08:27:45] (03PS9) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [08:28:48] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) [08:30:36] !log pool cp4029 with ATS backend T227432 [08:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:40] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:31:34] (03PS4) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [08:31:36] (03PS3) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [08:31:38] (03PS2) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [08:31:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump kafka-logging default partitions [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [08:31:49] (03CR) 10jerkins-bot: [V: 04-1] blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [08:31:52] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:34:08] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/544780 (https://phabricator.wikimedia.org/T231433) [08:34:10] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/544781 (https://phabricator.wikimedia.org/T231433) [08:34:16] !log Switch from nginx to ats-tls on cp2022 - T231627 [08:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:20] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [08:34:49] !log Deploy security patch (T234862) [08:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:07] (03PS3) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [08:35:19] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/544780 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:35:23] (03CR) 10jerkins-bot: [V: 04-1] blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [08:37:30] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2022 [puppet] - 10https://gerrit.wikimedia.org/r/544781 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:41:23] (03PS4) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [08:43:42] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:43:52] (03PS4) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [08:44:47] (03CR) 10Jbond: "thanks for the quick review" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [08:46:32] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: set template_overwrite true in elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/544209 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [08:49:04] (03PS4) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [08:49:43] (03CR) 10Gehel: [C: 03+1] "We probably also want to deploy envoy on other WDQS roles:" [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:49:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-makedomain: allow transfers on domains owned by admin project [puppet] - 10https://gerrit.wikimedia.org/r/544223 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [08:49:56] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [08:50:18] arturo: merging your change too [08:50:26] godog: ok thanks [08:50:39] np [08:52:39] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) In nagios format: ` root@helium:~$ python3 check_bacula.py All failures: 12 (phab1001), Stale: 5 (puppetmaster2001), Stale full only: 1 (cobal... [08:52:47] !log roll-restart logstash to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/544209 [08:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-makedomain: fix help message [puppet] - 10https://gerrit.wikimedia.org/r/544785 [09:00:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-makedomain: fix help message [puppet] - 10https://gerrit.wikimedia.org/r/544785 (owner: 10Arturo Borrero Gonzalez) [09:01:30] !log installing openjpeg2 security updates [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:32] (03CR) 10Volans: [C: 04-1] "Thanks for starting the migration of puppet-merge to python." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [09:07:09] (03PS5) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:07:52] !log installing jackson-databind security updates [09:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:08] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:09:38] (03CR) 10Muehlenhoff: puppet-merge: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [09:09:48] (03CR) 10Gehel: [C: 04-1] query_service: prepare query_service for reusbility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:10:22] (03PS6) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:11:02] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:11:17] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:11:56] !log installing subversion updates on Stretch (fixes compatibility with security fix for Apache update) [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:31] (03PS7) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:16:51] (03CR) 10Gehel: "Some duplication should be addressed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:17:58] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:18:16] !log installing php7.0 security updates [09:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:18] (03CR) 10Gehel: [C: 03+1] "Minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:20:40] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T234774) [09:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:44] T234774: Add Wikidata support for banwiki - https://phabricator.wikimedia.org/T234774 [09:24:12] moritzm: hi, and I have filled a task about potentially getting rid of subversion for Phabricator https://phabricator.wikimedia.org/T236026 [09:24:21] I don't think we still have any use for it [09:27:20] it's mostly installed on researcher hosts, probably so that they can access SVN repos, no idea [09:27:39] (03Abandoned) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246 (owner: 10Filippo Giunchedi) [09:28:54] (03CR) 10Volans: "LGTM, just few minor comments inline. I guess you plan to add this, deploy it and test it randomly on a bunch of hosts before adding the N" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [09:31:28] (03CR) 10Filippo Giunchedi: [C: 03+2] site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [09:31:35] (03PS8) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:31:42] (03PS15) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) [09:33:15] (03CR) 10Muehlenhoff: "Thanks for the review! I'll address comments later the day. I was in fact planning to first merge the script and test the various combinat" [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [09:33:41] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:34:05] hashar: I'll remove PHP 7.0 from deploy* servers, it's still present there [09:35:29] !log removing PHP 7.0 from deployment servers [09:35:30] moritzm: I would hope that it has never been used or at least everything pointed to /usr/bin/php which is now 7.2 :] [09:35:31] good luck! [09:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:02] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10jijiki) @Joe should we Resolve this? [09:37:30] (03PS2) 10Mathew.onipe: wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:37:32] (03PS1) 10Mathew.onipe: wdqs: envoy TLS termination for other clusters [puppet] - 10https://gerrit.wikimedia.org/r/544829 (https://phabricator.wikimedia.org/T210411) [09:41:33] (03PS9) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:44:36] (03PS3) 10Ema: wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) [09:47:32] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:47:35] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for wdqs [labs/private] - 10https://gerrit.wikimedia.org/r/544773 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:48:02] (03CR) 10Ema: [C: 03+2] wdqs: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/544770 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:50:47] (03PS1) 10Giuseppe Lavagetto: systemd: remove references to hhvm in the tests [puppet] - 10https://gerrit.wikimedia.org/r/544844 [09:51:26] (03PS4) 10Ema: wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) [09:51:57] !log maintenance script is done [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:31] (03PS1) 10Filippo Giunchedi: swift: use resurce for swiftrepl tidy [puppet] - 10https://gerrit.wikimedia.org/r/544845 (https://phabricator.wikimedia.org/T162123) [09:54:09] (03CR) 10Ema: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18941/" [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:54:20] PROBLEM - DPKG on mw2232 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:30] PROBLEM - DPKG on mw2224 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:32] PROBLEM - DPKG on mw2219 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:33] PROBLEM - DPKG on mw2274 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:36] PROBLEM - DPKG on mw2204 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:38] PROBLEM - DPKG on mw2157 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:38] PROBLEM - DPKG on mw2234 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:38] PROBLEM - DPKG on mw2137 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:40] PROBLEM - DPKG on mw2207 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:46] PROBLEM - DPKG on mw2221 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:46] PROBLEM - DPKG on mw2229 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:46] PROBLEM - DPKG on mw2217 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:46] PROBLEM - DPKG on mw2210 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:52] PROBLEM - DPKG on mw2144 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:52] PROBLEM - DPKG on mw2200 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:52] PROBLEM - DPKG on mw2216 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:52] PROBLEM - DPKG on mw2208 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:52] PROBLEM - DPKG on mw2203 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:56] PROBLEM - DPKG on mw2143 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:58] PROBLEM - DPKG on mw2236 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:58] PROBLEM - DPKG on mw2233 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:54:58] PROBLEM - DPKG on mw2241 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:00] PROBLEM - DPKG on mw2215 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:00] somebody unplugged the wrong thing ^ [09:55:02] PROBLEM - DPKG on mw2230 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:02] PROBLEM - DPKG on mw2220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:02] PROBLEM - DPKG on mw2142 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:04] PROBLEM - DPKG on mw2227 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:04] PROBLEM - DPKG on mw2212 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:06] PROBLEM - DPKG on mw2163 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:08] PROBLEM - DPKG on mw2235 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:08] PROBLEM - DPKG on mw2228 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:10] PROBLEM - DPKG on mw2138 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:10] PROBLEM - DPKG on mw2161 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:10] PROBLEM - DPKG on mw2140 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:12] PROBLEM - DPKG on mw2225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:12] PROBLEM - DPKG on mw2223 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:12] PROBLEM - DPKG on mw2155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:20] PROBLEM - DPKG on mw2136 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:20] PROBLEM - DPKG on mw2156 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:24] PROBLEM - DPKG on mw2218 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:24] PROBLEM - DPKG on mw2147 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:24] PROBLEM - DPKG on mw2226 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:24] PROBLEM - DPKG on mw2152 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:28] PROBLEM - DPKG on mw2209 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:30] PROBLEM - DPKG on mw2154 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:30] PROBLEM - DPKG on mw2206 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:32] PROBLEM - DPKG on mw2222 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:32] PROBLEM - DPKG on mw2139 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:32] PROBLEM - DPKG on mw2211 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:32] PROBLEM - DPKG on mw2214 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:35] (03PS1) 10Effie Mouzeli: spec: remove hhvm references from tests [puppet] - 10https://gerrit.wikimedia.org/r/544847 (https://phabricator.wikimedia.org/T229792) [09:55:36] PROBLEM - DPKG on mw2146 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:36] PROBLEM - DPKG on mw2158 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:36] PROBLEM - DPKG on mw2202 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:36] PROBLEM - DPKG on mw2159 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:37] <_joe_> effie: ^^ any idea why? [09:55:40] PROBLEM - DPKG on mw2141 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:42] PROBLEM - DPKG on mw2205 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:46] PROBLEM - DPKG on mw2145 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:46] PROBLEM - DPKG on mw2160 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:51] _joe_: I didn't push anything [09:55:52] PROBLEM - DPKG on mw2201 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:56] PROBLEM - DPKG on mw2237 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:56] PROBLEM - DPKG on mw2135 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:55:56] PROBLEM - DPKG on mw2162 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:56:06] PROBLEM - DPKG on mw2153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:56:08] but le'ts find out [09:56:18] Amir1: Hi. I see you ran the Wikidata script but I was not able to add Wikidata items on banwiki yet. Does it take a while? [09:56:26] <_joe_> effie: I think you're uninstalling hhvm? [09:56:26] PROBLEM - DPKG on mw2267 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:56:35] <_joe_> if so, that's kinda-expected [09:56:35] _joe_: no [09:56:39] no no [09:56:40] no, I'm installing the 7.0 update, looking [09:56:40] hauskater: yes, as I mentioned in the ticket, it's going to take two hours [09:56:46] I have not pushed that package yet [09:56:49] patch* [09:56:55] Amir1: ah, didn't saw the ticket. Well I could add one already. Thanks! [09:57:19] <_joe_> moritzm: argh why the 7.0 update? [09:57:26] <_joe_> we should remove php7.0 if we can [09:57:34] _joe_: we will reimage anyway [09:57:43] so removing is only necessary [09:57:47] in a handful of hosts [09:57:48] <_joe_> effie: ok ok I was trying to figure out what was going on [09:58:04] sure sure [09:58:31] I thing you said blubberoid too many times already today [09:58:39] and production is reacting to this [09:58:46] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01015 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:59:06] RECOVERY - DPKG on mw2201 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:00:28] <_joe_> moritzm: the installation seems stuck on the postinst of php7.0-cli [10:00:39] yeah, I'm on it, it [10:00:57] it's a similar issue as with PHp 7.2, triggering a conffile prompt [10:01:05] (03CR) 10Jbond: "> Patch Set 4: Code-Review-1" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [10:02:04] RECOVERY - DPKG on mw2202 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:02:28] (03PS5) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [10:02:56] RECOVERY - DPKG on mw2203 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:02:59] mhhh a whole lot of changes for zuul to process, looks like things are stuck [10:03:10] PROBLEM - DPKG on mw2281 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:03:27] cc hashar ^ known ? [10:03:32] RECOVERY - DPKG on mw2209 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:03:36] RECOVERY - DPKG on mw2206 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:03:46] RECOVERY - DPKG on mw2205 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:16] RECOVERY - DPKG on mw2219 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:18] RECOVERY - DPKG on mw2204 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:20] RECOVERY - DPKG on mw2234 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:24] RECOVERY - DPKG on mw2207 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:25] godog: I'd say yes [10:04:28] RECOVERY - DPKG on mw2210 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:28] RECOVERY - DPKG on mw2217 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:28] RECOVERY - DPKG on mw2229 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:28] RECOVERY - DPKG on mw2221 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:34] RECOVERY - DPKG on mw2216 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:34] RECOVERY - DPKG on mw2208 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:40] RECOVERY - DPKG on mw2236 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:42] RECOVERY - DPKG on mw2233 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:44] RECOVERY - DPKG on mw2215 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:44] RECOVERY - DPKG on mw2230 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:46] RECOVERY - DPKG on mw2220 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:48] RECOVERY - DPKG on mw2227 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:48] RECOVERY - DPKG on mw2212 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:52] RECOVERY - DPKG on mw2228 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:52] RECOVERY - DPKG on mw2235 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:54] RECOVERY - DPKG on mw2225 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:04:54] RECOVERY - DPKG on mw2223 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:03] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "CI/zuul currently in trouble, forcing verified as change is trivial" [puppet] - 10https://gerrit.wikimedia.org/r/544845 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [10:05:08] RECOVERY - DPKG on mw2218 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:08] RECOVERY - DPKG on mw2147 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:08] RECOVERY - DPKG on mw2226 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:16] RECOVERY - DPKG on mw2222 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:16] RECOVERY - DPKG on mw2211 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:18] RECOVERY - DPKG on mw2146 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:26] RECOVERY - DPKG on mw2141 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:27] Daimona: thank you [10:05:30] RECOVERY - DPKG on mw2145 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:40] RECOVERY - DPKG on mw2135 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:40] RECOVERY - DPKG on mw2237 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:40] RECOVERY - DPKG on mw2232 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:40] RECOVERY - DPKG on mw2162 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:50] RECOVERY - DPKG on mw2153 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:05:50] RECOVERY - DPKG on mw2224 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:00] RECOVERY - DPKG on mw2157 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:00] RECOVERY - DPKG on mw2137 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:12] RECOVERY - DPKG on mw2144 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:13] PROBLEM - DPKG on mw2273 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:16] RECOVERY - DPKG on mw2143 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:24] RECOVERY - DPKG on mw2142 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:24] PROBLEM - DPKG on mw2247 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:28] RECOVERY - DPKG on mw2163 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:30] RECOVERY - DPKG on mw2138 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:30] RECOVERY - DPKG on mw2161 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:30] RECOVERY - DPKG on mw2140 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:36] (03PS5) 10Mathew.onipe: wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:06:38] (03PS2) 10Mathew.onipe: wdqs: envoy TLS termination for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/544829 (https://phabricator.wikimedia.org/T210411) [10:06:42] RECOVERY - DPKG on mw2136 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:42] RECOVERY - DPKG on mw2156 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:46] RECOVERY - DPKG on mw2152 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:48] PROBLEM - DPKG on mw2278 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:50] RECOVERY - DPKG on mw2154 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:50] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:06:56] RECOVERY - DPKG on mw2158 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:32] PROBLEM - DPKG on mw2263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:07:40] PROBLEM - DPKG on mw2264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:07:42] PROBLEM - DPKG on mw2258 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:07:52] PROBLEM - DPKG on mw2272 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:07:52] PROBLEM - DPKG on mw2254 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:07:52] PROBLEM - DPKG on mw2270 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:08:02] PROBLEM - DPKG on mw2259 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:08:12] PROBLEM - DPKG on mw2266 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:08:30] PROBLEM - DPKG on mw2280 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:08:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544853 (https://phabricator.wikimedia.org/T128546) [10:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:12] RECOVERY - DPKG on mw2274 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:20] RECOVERY - DPKG on mw2264 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:28] RECOVERY - DPKG on mw2267 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:32] RECOVERY - DPKG on mw2272 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:32] RECOVERY - DPKG on mw2273 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:32] RECOVERY - DPKG on mw2270 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:44] RECOVERY - DPKG on mw2281 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:09:50] RECOVERY - DPKG on mw2266 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:10:06] RECOVERY - DPKG on mw2278 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:10:08] RECOVERY - DPKG on mw2280 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:10:50] RECOVERY - DPKG on mw2263 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:11:18] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:30] the ms-fe failures are known, that's me [10:12:48] (03PS1) 10Awight: Put reference previews back into beta mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544855 (https://phabricator.wikimedia.org/T233813) [10:13:33] !log CI in trouble due to a huge number of changes [10:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:40] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [10:17:56] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [10:19:05] (03CR) 10jerkins-bot: [V: 04-1] spec: remove hhvm references from tests [puppet] - 10https://gerrit.wikimedia.org/r/544847 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:19:42] !log contint1001 / contint2001 : marking integration/config zuul merger repo readonly: sudo chown -R root:root /srv/zuul/git/integration/config [10:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:52] (03PS1) 10Ema: Add wdqs-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/544856 (https://phabricator.wikimedia.org/T210411) [10:27:34] (03CR) 10Ema: [C: 03+2] wdqs: TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/544672 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:28:45] (03PS6) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [10:29:00] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:52] awight: If you want I can just sling out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/544855 now. [10:29:58] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1030). [10:30:17] Oops, a minute too late. ;-) [10:33:07] James_F: I won't be too long :) [10:33:16] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544853 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:35] jan_drewniak: do you think you can update site-stats.json for me? I was unable to. [10:33:42] I filed a task [10:33:48] gulp hates me apparently [10:33:50] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544853 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:28] T235965 was it [10:34:28] T235965: Update Module:Project portal/views.json - https://phabricator.wikimedia.org/T235965 [10:35:42] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:04] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [10:38:03] James_F: Thanks, that would be helpful! No worries if I missed the window for that offer. [10:38:16] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:544853| Bumping portals to master (T128546)]] (duration: 01m 00s) [10:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:20] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:38:51] hauskater: hey, yeah I can update site-stats.json. I wish I had time to update the Gulp stuff in the portals repo. [10:39:16] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:544853| Bumping portals to master (T128546)]] (duration: 01m 00s) [10:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:35] jan_drewniak: much thanks. I swear I was unable to run gulp. No idea what's wrong :) [10:42:07] hauskater: yeah, it's the miracle of modern build toolchains: touch nothing and it'll burst into flames in 3 years 🤦‍♂️ [10:42:24] awight: I'm done [10:43:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] host monitoring: add optional contact group for mgmt interfaces (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [10:45:05] 10Operations, 10serviceops: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10jijiki) p:05Triage→03Normal a:03jijiki [10:45:14] 10Operations, 10serviceops: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10jijiki) [10:45:19] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [10:54:42] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:00] (03PS1) 10Jbond: decommission rhodium: move rhodium into the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/544860 (https://phabricator.wikimedia.org/T235503) [10:55:43] (03CR) 10jerkins-bot: [V: 04-1] decommission rhodium: move rhodium into the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/544860 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [10:57:09] jouncebot: next [10:57:09] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1100) [10:59:08] (03PS1) 10Filippo Giunchedi: swiftrepl: ensure system user and service runs as 'swiftrepl' [puppet] - 10https://gerrit.wikimedia.org/r/544863 (https://phabricator.wikimedia.org/T162123) [10:59:51] (03PS2) 10KartikMistry: Enable CX out of beta in Malayalam/Bengali/Mongolian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543764 (https://phabricator.wikimedia.org/T233008) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1100). [11:00:04] mobrovac, kart_, and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:22] here. [11:00:37] ah right, it's swat time [11:00:45] o/ [11:00:48] mobrovac: go with your patch and ping me. [11:01:09] kk [11:01:11] * mobrovac swatting [11:01:21] (03PS2) 10Jbond: decommission rhodium: move rhodium into the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/544860 (https://phabricator.wikimedia.org/T235503) [11:01:34] (03CR) 10Filippo Giunchedi: [C: 03+2] swiftrepl: ensure system user and service runs as 'swiftrepl' [puppet] - 10https://gerrit.wikimedia.org/r/544863 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [11:01:34] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) [11:01:52] (03PS2) 10Filippo Giunchedi: swiftrepl: ensure system user and service runs as 'swiftrepl' [puppet] - 10https://gerrit.wikimedia.org/r/544863 (https://phabricator.wikimedia.org/T162123) [11:02:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544860 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:04:00] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18947/console" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [11:04:30] (03CR) 10Jbond: [C: 03+2] decommission rhodium: move rhodium into the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/544860 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:04:42] (03PS1) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 [11:07:29] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (owner: 10Effie Mouzeli) [11:07:36] PROBLEM - DPKG on mw2192 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:40] PROBLEM - DPKG on mw2174 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:42] PROBLEM - DPKG on mw2191 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:48] PROBLEM - DPKG on mw2193 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:54] PROBLEM - DPKG on mw2189 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:54] PROBLEM - DPKG on mw2173 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:08] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) [11:08:10] PROBLEM - DPKG on mw2179 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:16] PROBLEM - DPKG on mw2188 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:20] PROBLEM - DPKG on mw2197 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:28] PROBLEM - DPKG on mw2257 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:28] PROBLEM - DPKG on mw2194 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:30] PROBLEM - DPKG on mw2243 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:32] PROBLEM - DPKG on mw2239 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:36] PROBLEM - DPKG on mw2196 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:40] PROBLEM - DPKG on mw2249 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:40] PROBLEM - DPKG on mw2246 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:40] PROBLEM - DPKG on mw2184 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:46] PROBLEM - DPKG on mw2165 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:48] PROBLEM - DPKG on mw2256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:50] PROBLEM - DPKG on mw2198 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:50] PROBLEM - DPKG on mw2199 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:52] PROBLEM - DPKG on mw2252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:56] what? [11:09:04] PROBLEM - DPKG on mw2248 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:04] PROBLEM - DPKG on mw2190 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:04] PROBLEM - DPKG on mw2195 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:12] PROBLEM - DPKG on mw2242 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:12] PROBLEM - DPKG on mw2240 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:14] PROBLEM - DPKG on mw2251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:16] PROBLEM - DPKG on mw2255 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:18] PROBLEM - DPKG on mw2238 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:20] PROBLEM - DPKG on mw2290 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:32] PROBLEM - DPKG on mw2253 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:10:02] Is this going to spam for all the codfw mw hosts? [11:10:20] I don't know. What's broken? [11:11:47] Looks to be php7.0-related from `dpkg -l | grep -v "^ii"` [11:12:38] i think moritzm was looking at this, it alerted erlier as well [11:12:43] Ah, Ok. [11:13:03] yeah, maybe a few slipped through and now downtime expired, looking into it [11:13:04] yeah I just found it in the backscroll [11:13:42] RECOVERY - DPKG on mw2247 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:13:52] RECOVERY - DPKG on mw2253 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:13:56] RECOVERY - DPKG on mw2257 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:06] RECOVERY - DPKG on mw2198 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:06] RECOVERY - DPKG on mw2199 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:06] RECOVERY - DPKG on mw2256 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:08] RECOVERY - DPKG on mw2252 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:20] RECOVERY - DPKG on mw2258 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:22] RECOVERY - DPKG on mw2248 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:22] RECOVERY - DPKG on mw2190 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:32] RECOVERY - DPKG on mw2192 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:32] RECOVERY - DPKG on mw2242 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:32] RECOVERY - DPKG on mw2240 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:32] RECOVERY - DPKG on mw2254 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:33] RECOVERY - DPKG on mw2251 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:36] RECOVERY - DPKG on mw2255 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:38] RECOVERY - DPKG on mw2174 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:40] RECOVERY - DPKG on mw2241 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:40] RECOVERY - DPKG on mw2259 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:40] RECOVERY - DPKG on mw2290 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:55] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission [11:14:56] RECOVERY - DPKG on mw2173 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:00] RECOVERY - DPKG on mw2155 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:15:06] RECOVERY - DPKG on mw2179 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:15:35] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:15:36] RECOVERY - DPKG on mw2200 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] 10Operations, 10DC-Ops, 10decommission: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `rhodium.eqiad.wmnet` - rhodium.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Downtimed management... [11:15:46] RECOVERY - DPKG on mw2238 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:17:05] 10Operations, 10DC-Ops, 10decommission: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) [11:17:46] RECOVERY - DPKG on mw2249 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:18:19] (03PS2) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [11:19:51] 10Operations, 10DC-Ops, 10decommission: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10MoritzMuehlenhoff) [11:20:13] (03PS1) 10Jbond: decomission rhodium: completely remove rhodium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/544865 (https://phabricator.wikimedia.org/T235503) [11:20:54] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:21:36] RECOVERY - DPKG on mw2197 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:21:36] RECOVERY - DPKG on mw2246 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:21:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544865 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:22:54] RECOVERY - DPKG on mw2191 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:22:56] RECOVERY - DPKG on mw2159 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:23:24] RECOVERY - DPKG on mw2165 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:23:37] (03CR) 10Jbond: [C: 03+2] decomission rhodium: completely remove rhodium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/544865 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:23:42] (03PS1) 10Jbond: decomission rhodium: remove rhodium DNS entries [dns] - 10https://gerrit.wikimedia.org/r/544867 (https://phabricator.wikimedia.org/T235503) [11:24:23] mobrovac: deploying? [11:24:40] RECOVERY - DPKG on mw2188 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:24:40] RECOVERY - DPKG on mw2184 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:25:48] RECOVERY - DPKG on mw2189 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:25:54] !log mobrovac@deploy1001 Synchronized php-1.35.0-wmf.2/includes/Storage/SqlBlobStore.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 1/3 - T235188 (duration: 01m 00s) [11:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:58] kart_: sync started now [11:25:58] T235188: Some revisions' contents are incorrect in the cache - wrong contents shown in history & diffs - https://phabricator.wikimedia.org/T235188 [11:26:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM as first iteration, see inline for nits" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [11:26:11] got two more files and then i'm done [11:26:25] sorry, was waiting on jenkins for a long time this being a core change [11:26:28] RECOVERY - DPKG on mw2195 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:26:56] RECOVERY - DPKG on mw2214 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:02] RECOVERY - DPKG on mw2193 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:17] mobrovac: Jenkins is done :) [11:27:22] RECOVERY - DPKG on mw2139 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:22] RECOVERY - DPKG on mw2160 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:22] RECOVERY - DPKG on mw2243 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:22] RECOVERY - DPKG on mw2196 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:22] RECOVERY - DPKG on mw2239 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:22] RECOVERY - DPKG on mw2194 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:27:36] (03CR) 10Muehlenhoff: [C: 03+1] decomission rhodium: remove rhodium DNS entries [dns] - 10https://gerrit.wikimedia.org/r/544867 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:27:40] kart_: i know, as i said, in the process of syncing [11:27:51] OK OK. I saw your messege late :) [11:28:05] !log mobrovac@deploy1001 Synchronized php-1.35.0-wmf.2/includes/libs/objectcache/wancache/WANObjectCache.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 2/3 - T235188 (duration: 00m 59s) [11:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:34] mobrovac: Thanks. Should I go ahead with my config change? [11:30:33] kart_: wait a sec more, file 3/3 syncing now [11:30:36] !log mobrovac@deploy1001 Synchronized php-1.35.0-wmf.2/tests/phpunit/includes/Storage/SqlBlobStoreTest.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 3/3 - T235188 (duration: 01m 00s) [11:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:44] kart_: ok done, all yours :) [11:30:53] Thanks! [11:31:12] I should've look at patch about number of files. Self note for future! [11:31:14] :) [11:31:36] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543764 (https://phabricator.wikimedia.org/T233008) (owner: 10KartikMistry) [11:32:29] (03Merged) 10jenkins-bot: Enable CX out of beta in Malayalam/Bengali/Mongolian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543764 (https://phabricator.wikimedia.org/T233008) (owner: 10KartikMistry) [11:33:37] (03PS2) 10Awight: Put reference previews back into beta mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544855 (https://phabricator.wikimedia.org/T233813) [11:33:39] (03PS10) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:34:18] !log installing Java security updates on restbase-dev1004 [11:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:39] (03PS11) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:37:03] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:37:39] (03CR) 10Jbond: [C: 03+2] decomission rhodium: remove rhodium DNS entries [dns] - 10https://gerrit.wikimedia.org/r/544867 (https://phabricator.wikimedia.org/T235503) (owner: 10Jbond) [11:37:41] (03PS7) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [11:38:00] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|543764|Enable ContentTranslation out of Beta in Malayalam/Bengali/Mongolian WPs (T233008, T233009, T234317)]] (duration: 01m 00s) [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:08] T234317: Enable Content Translation in Mongolian Wikipedia as a default tool - https://phabricator.wikimedia.org/T234317 [11:38:08] T233008: Enable Content Translation in Malayalam Wikipedia as a default tool - https://phabricator.wikimedia.org/T233008 [11:38:08] T233009: Enable Content Translation in Bengali Wikipedia as a default tool - https://phabricator.wikimedia.org/T233009 [11:38:57] kart_: Is this a good time for me to do some deployment? [11:39:27] awight: go ahead. I'm done. [11:39:32] Thanks! [11:39:35] (03CR) 10Awight: [C: 03+2] Put reference previews back into beta mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544855 (https://phabricator.wikimedia.org/T233813) (owner: 10Awight) [11:40:05] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) a:03Jclark-ctr [11:40:29] (03Merged) 10jenkins-bot: Put reference previews back into beta mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544855 (https://phabricator.wikimedia.org/T233813) (owner: 10Awight) [11:40:57] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10jbond) @Jclark-ctr i believe everything is done from my side but please let me know if i missed anything [11:42:23] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:544855|Put reference previews back into beta mode on beta cluster (T233813)]] (duration: 01m 00s) [11:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:27] T233813: Separate user preferences for Page previews and Reference previews - https://phabricator.wikimedia.org/T233813 [11:42:37] !log EU SWAT complete [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:22] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005806 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:43:35] (03PS12) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:45:38] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:48:08] (03CR) 10Jbond: "Ricardo's comments aside LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [11:48:37] (03PS13) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:49:59] !log Reopen EU SWAT [11:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:36] (03PS2) 10Urbanecm: wgCopyUploadDomains: Add iip.bu.uni.wroc.pl there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544400 (https://phabricator.wikimedia.org/T235904) (owner: 10Zoranzoki21) [11:50:58] (03CR) 10Urbanecm: [C: 03+2] wgCopyUploadDomains: Add iip.bu.uni.wroc.pl there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544400 (https://phabricator.wikimedia.org/T235904) (owner: 10Zoranzoki21) [11:51:14] (03PS14) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [11:51:36] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544343 (https://phabricator.wikimedia.org/T235343) (owner: 10Jayprakash12345) [11:51:44] (03PS2) 10Urbanecm: Create Portal namespace for sawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544343 (https://phabricator.wikimedia.org/T235343) (owner: 10Jayprakash12345) [11:51:48] (03Merged) 10jenkins-bot: wgCopyUploadDomains: Add iip.bu.uni.wroc.pl there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544400 (https://phabricator.wikimedia.org/T235904) (owner: 10Zoranzoki21) [11:52:09] (03PS3) 10Urbanecm: Create Portal namespace for sawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544343 (https://phabricator.wikimedia.org/T235343) (owner: 10Jayprakash12345) [11:52:18] (03CR) 10Urbanecm: [C: 03+2] Create Portal namespace for sawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544343 (https://phabricator.wikimedia.org/T235343) (owner: 10Jayprakash12345) [11:53:06] (03Merged) 10jenkins-bot: Create Portal namespace for sawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544343 (https://phabricator.wikimedia.org/T235343) (owner: 10Jayprakash12345) [11:54:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 3b1350b: wgCopyUploadDomains: Add iip.bu.uni.wroc.pl there (T235904) (duration: 00m 59s) [11:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:24] T235904: add http://iip.bu.uni.wroc.pl to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T235904 [11:55:56] !log urbanecm@deploy1001 sync-file aborted: SWAT: 12e3549: Create Portal namespace for sawikisource (duration: 00m 01s) [11:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 12e3549: Create Portal namespace for sawikisource (T235343) (duration: 00m 59s) [11:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:10] T235343: Create Portal namespace for sawikisource - https://phabricator.wikimedia.org/T235343 [11:58:00] (03PS7) 10Urbanecm: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [11:58:48] (03CR) 10Urbanecm: [C: 03+2] Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [11:59:46] (03Merged) 10jenkins-bot: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [12:00:31] !log I'm going to do one last sync for EU SWAT [12:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e8d70c1: Partial cleanup of InitialiseSettings (T231178) (duration: 01m 00s) [12:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:59] T231178: General cleanup of initialize settings - https://phabricator.wikimedia.org/T231178 [12:02:08] (03PS8) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [12:02:25] !log EU SWAT finally done [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:59] (03CR) 10Jbond: puppet-merge: refactor (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [12:10:35] (03CR) 10Giuseppe Lavagetto: "I like the approach but I have a general comment on the puppet code, that I'd like to see a bit more structured for readability." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [12:11:41] (03PS1) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) [12:12:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] systemd: remove references to hhvm in the tests [puppet] - 10https://gerrit.wikimedia.org/r/544844 (owner: 10Giuseppe Lavagetto) [12:12:32] (03CR) 10jerkins-bot: [V: 04-1] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [12:12:34] (03CR) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [12:13:10] (03PS4) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [12:14:23] (03PS1) 10Volans: homer: add netbox credentials to the configuration [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) [12:14:41] (03CR) 10jerkins-bot: [V: 04-1] homer: add netbox credentials to the configuration [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:15:45] (03PS1) 10Jon Harald Søby: Add Balinese to interwiki sort orders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544882 (https://phabricator.wikimedia.org/T234768) [12:16:15] !log Stopped zuul-merger on contint1001 [12:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] scaffold: Add option for TLS termination (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto) [12:17:33] (03PS1) 10Giuseppe Lavagetto: envoy: update to 1.11.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/544884 [12:18:00] (03PS2) 10Effie Mouzeli: spec: remove hhvm references from tests [puppet] - 10https://gerrit.wikimedia.org/r/544847 (https://phabricator.wikimedia.org/T229792) [12:18:41] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:18:42] (03PS2) 10Volans: homer: add netbox credentials to the configuration [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) [12:19:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] scaffold: Add option for TLS termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto) [12:19:22] (03PS3) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [12:20:08] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18949/" [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [12:20:31] (03CR) 10Effie Mouzeli: [V: 03+1] "On a first look it seems ok https://puppet-compiler.wmflabs.org/compiler1002/18948/" [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:21:13] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:21:43] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:26:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] scaffold: only expose one port as a service by default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [12:27:14] (03PS4) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [12:27:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [12:27:56] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [140.0] amusso too many changes https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:27:56] ACKNOWLEDGEMENT - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger amusso too many changes https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:28:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:29:36] (03PS2) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) [12:31:54] !log Started zuul-merger on contint1001 [12:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:04] !log Stopped zuul-merger on contint2001 [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:03] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:35:29] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:37:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add wdqs-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/544856 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:38:29] !log Started zuul-merger on contint2001 [12:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:39:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy: update to 1.11.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/544884 (owner: 10Giuseppe Lavagetto) [12:41:34] (03CR) 10Volans: "Looks good, reply inline and a nit." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [12:52:51] (03CR) 10Ema: [C: 03+2] Add wdqs-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/544856 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:57:47] 10Operations, 10Discovery-Search, 10Elasticsearch: Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Mathew.onipe) Issue still persist: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+unassigned+shard+check+-+9243 [12:58:17] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Mathew.onipe) [12:58:31] !log lvs2006: restart pybal to add new service wdqs-ssl T210411 [12:58:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:35] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [12:59:15] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.32:443]) https://wikitech.wikimedia.org/wiki/PyBal [12:59:25] this should clear soon ^ [13:00:15] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wdqs-ssl [13:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! A first round of comments" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:02:45] !log lvs1016: restart pybal to add new service wdqs-ssl T210411 [13:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:52] (03CR) 10Alexandros Kosiaris: backup: Migrate bacula director from helium to backup1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [13:03:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1090:3312 after schema change and remove db1129 from vslow and dump as it was was there temporarily', diff saved to https://phabricator.wikimedia.org/P9409 and previous config saved to /var/cache/conftool/dbconfig/20191021-130355-marostegui.json [13:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:04] !log lvs2003: restart pybal to add new service wdqs-ssl T210411 [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:08] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [13:04:47] !log Deploy schema change on db1122 (s2 primary master) - T233135 T234066 [13:04:49] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:52] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [13:04:52] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [13:07:29] !log lvs1015: restart pybal to add new service wdqs-ssl T210411 [13:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:27] (03CR) 10Muehlenhoff: [C: 03+1] "With the hiera.yaml changes merged separately, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [13:10:50] (03PS1) 10Ema: ATS: use TLS to connect to WDQS [puppet] - 10https://gerrit.wikimedia.org/r/544904 (https://phabricator.wikimedia.org/T210411) [13:16:30] * Urbanecm is going to deploy a security patch [13:17:39] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:19:40] (03CR) 10Ema: [C: 03+2] ATS: use TLS to connect to WDQS [puppet] - 10https://gerrit.wikimedia.org/r/544904 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:19:41] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:21:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights 1/2 to 100/200 on s2 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9410 and previous config saved to /var/cache/conftool/dbconfig/20191021-132145-marostegui.json [13:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [13:23:22] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:24:31] (03PS6) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [13:24:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights 1/2 to 100/200 on s2 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9411 and previous config saved to /var/cache/conftool/dbconfig/20191021-132440-marostegui.json [13:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2084:3314 and db2091:3312 for table compression', diff saved to https://phabricator.wikimedia.org/P9412 and previous config saved to /var/cache/conftool/dbconfig/20191021-132633-marostegui.json [13:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:47] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:50] (03CR) 10Jbond: [C: 03+2] profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [13:29:58] (03CR) 10CDanis: puppet-merge: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [13:30:36] (03PS1) 10Filippo Giunchedi: swift: use systemd::unit for swiftrepl-mw [puppet] - 10https://gerrit.wikimedia.org/r/544911 (https://phabricator.wikimedia.org/T162123) [13:30:38] (03PS1) 10Filippo Giunchedi: swift: use systemd::timer::job for swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) [13:32:29] (03PS2) 10Filippo Giunchedi: swift: use systemd::timer::job for swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) [13:32:47] (03CR) 10jerkins-bot: [V: 04-1] swift: use systemd::timer::job for swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [13:32:57] (03Abandoned) 10Filippo Giunchedi: swift: use systemd::unit for swiftrepl-mw [puppet] - 10https://gerrit.wikimedia.org/r/544911 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [13:34:10] (03PS3) 10Filippo Giunchedi: swift: use systemd::timer::job for swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) [13:34:11] <_joe_> hey who's trying to deploy something? [13:34:17] <_joe_> I mean mediawiki [13:34:25] <_joe_> I just got on the console of deploy1001 [13:34:30] <_joe_> php[5295]: PHP Parse error: syntax error, unexpected '&&' (T_BOOLEAN_AND) in /srv/mediawiki-staging/php-1.35.0-wmf.2/extensions/AbuseFilter/includes/Views/AbuseFilterViewDiff.php on line 113 [13:34:51] 10Operations, 10Traffic: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10ema) [13:34:59] 10Operations, 10Traffic: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10ema) p:05Triage→03Normal [13:36:03] (03CR) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [13:36:13] (03PS5) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [13:38:50] (03PS4) 10Filippo Giunchedi: swift: use systemd::timer::job for swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) [13:38:58] (03CR) 10Subramanya Sastry: [C: 03+1] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [13:43:11] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/18954/ms-fe1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/544912 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [13:43:45] (03CR) 10Volans: [C: 03+1] "LGTM, looking forward to see the corner cases, you can run it via cumin once deployed and see what happens :)" [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [13:44:26] (03PS6) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [13:46:37] !log Deploy sec patch for T104807 [13:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:44] (03CR) 10Muehlenhoff: [C: 03+2] Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [13:46:49] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:36] (03CR) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [13:55:00] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123 (10fgiunchedi) swiftrepl is now running puppetized on both codfw and eqiad and running as a timer once a week per site. Left to do is shipping `swiftrep... [13:55:39] (03PS1) 10Jbond: puppet-merge: switch to GitPython [puppet] - 10https://gerrit.wikimedia.org/r/544922 [13:58:09] (03PS9) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [14:01:40] 10Operations, 10observability, 10Performance-Team (Radar): Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10CDanis) [14:09:17] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Gehel) Note that at the moment, eqiad is running with 33 nodes instead of 36, and our sharding decisions are based on a 36 nodes cluster. So reviewing those decisions, a... [14:13:51] (03PS1) 10Jeena Huneidi: all wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544930 [14:13:53] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544930 (owner: 10Jeena Huneidi) [14:14:48] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544930 (owner: 10Jeena Huneidi) [14:15:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "Getting there but IMHO there should be another wrapper define not extending service::monitor" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [14:16:43] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.2 refs T233850 [14:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:47] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [14:21:27] PROBLEM - Check the last execution of swiftrepl-mw on ms-fe2005 is CRITICAL: CRITICAL: Status of the systemd unit swiftrepl-mw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:25:06] (03CR) 10Arturo Borrero Gonzalez: "Since all our Debian Buster VMs are using sssd by default now, perhaps it would be good idea to double check that we need Debian Buster do" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (owner: 10Phamhi) [14:26:02] 10Operations: Error 503 and slow loading on multiple wikis (19th Oct 2019 21:28 - 21:36 UTC) - https://phabricator.wikimedia.org/T235949 (10CDanis) Estimate from Logstash is about 32k 50x served over an interval of about six minutes: https://logstash.wikimedia.org/goto/dfa60e42b71fad4c702ab95cbb55db55 One thing... [14:30:26] (03CR) 10Jcrespo: "To not answer every comment one by one, I will change all style suggestions, which are quite easy and small changes. Some additional extra" [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:34:06] (03CR) 10Jcrespo: "Missing response of puppet decision." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:34:08] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [14:34:18] 10Operations: Error 503 and slow loading on multiple wikis (19th Oct 2019 21:28 - 21:36 UTC) - https://phabricator.wikimedia.org/T235949 (10CDanis) Actually, looking at [[ https://turnilo.wikimedia.org/#webrequest_sampled_128/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEa... [14:38:47] (03CR) 10Jcrespo: "Plural" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [14:49:20] 10Operations: Error 503 and slow loading on multiple wikis (19th Oct 2019 21:28 - 21:36 UTC) - https://phabricator.wikimedia.org/T235949 (10Marostegui) Spike on the parsercache: https://grafana.wikimedia.org/d/000000273/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1009&var-por... [14:49:30] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10User-fgiunchedi: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10fgiunchedi) 05Open→03Resolved This is effectively done (i.e. swiftrepl is back), following up in {T162123} [14:49:38] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) [14:50:55] (03PS1) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [14:53:34] (03PS1) 10Muehlenhoff: Update microcode check [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) [14:55:34] (03CR) 10jerkins-bot: [V: 04-1] Update microcode check [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [14:56:11] (03CR) 10Jbond: puppet-merge: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [14:58:06] (03CR) 10Jbond: "@volans I did this CR before the work in https://gerrit.wikimedia.org/r/c/operations/puppet/+/544943" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [14:58:42] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [15:22:49] (03PS1) 10Alexandros Kosiaris: Add reprepo updates for cassandra311 [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) [15:24:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [15:28:04] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10greg) >>! In T234209#5567590, @herron wrote: > @greg could you please review/approve for deploy groups? +1 [15:28:35] (03CR) 10Volans: "One comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [15:30:09] (03PS1) 10Alexandros Kosiaris: Stop pinning the cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/544966 (https://phabricator.wikimedia.org/T200803) [15:36:01] (03CR) 10Jhedden: wikimedia.cloud: add initial zone file (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [15:44:20] (03CR) 10Bstorm: "> Patch Set 2:" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (owner: 10Phamhi) [15:45:35] (03CR) 10Jhedden: wikimedia.cloud: add initial zone file (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [15:54:47] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Nuria) Let's see, is kevinbazira a staff member? if so he only needs access to LDAP 'wmf' group. nda is not nee... [15:58:31] (03CR) 10Jbond: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [15:58:35] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) 1) reseat the hot swap nic, should reset 2) unplug the ps1, leaving ps2 powered, to reset the nic 3) reset with the reset button, will have to reconfigure the entire pdu (non-ideal, these ha... [15:59:27] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Mainly I'd like @bblack buy in on a date/time for me to do this work, since option 2 requires #traffic approval imo. (It would cause them work if any of the systems fail.) [16:00:16] (03CR) 10Muehlenhoff: "That was probably added for testing in labs? (Where unattended-upgrades can kick in). But seems sensible for prod." [puppet] - 10https://gerrit.wikimedia.org/r/544966 (https://phabricator.wikimedia.org/T200803) (owner: 10Alexandros Kosiaris) [16:01:53] (03PS15) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [16:03:56] (03CR) 10jerkins-bot: [V: 04-1] bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:10:53] 10Operations, 10ops-esams, 10DC-Ops: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [16:13:21] (03CR) 10Muehlenhoff: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [16:14:18] 10Operations, 10ops-esams, 10DC-Ops: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [16:21:21] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10RStallman-legalteam) NDA is signed and on file, Fine to proceed to the next steps. Best, Rachel [16:21:59] (03CR) 10Jbond: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [16:25:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10wiki_willy) a:03Cmjohnson [16:26:08] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10wiki_willy) a:03Cmjohnson [16:26:19] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:30:43] (03CR) 10BryanDavis: [C: 04-1] "Several comments inline, mostly related to PHP 7.3" (035 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (owner: 10Phamhi) [16:33:44] (03CR) 10Muehlenhoff: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [16:34:08] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10ema) >>! In T224570#5471336, @MoritzMuehlenhoff wrote: > More generally speaking: Are the pybal-test* servers still used for testing/developing? Is there a specific reason they are in prod... [16:36:36] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) Thank you, this is an interesting point. It reminds me to [[ https://meta.wikimedia.org/wiki/Talk:Wikimedia_Space#Clarification_request | a question... [16:39:07] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:45:54] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10jijiki) Server will be reimaged, I will ping here when... [16:46:23] (03CR) 10Jbond: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [16:49:25] (03CR) 10Muehlenhoff: Update microcode check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544944 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [16:50:38] 10Operations, 10ORES, 10serviceops, 10Patch-For-Review, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) [16:54:03] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10RobH) >>! In T235303#5575619, @RobH wrote: >>>! In T235303#5575089, @Andrew wrote: >> @robh shou... [16:55:48] (03PS3) 10Phamhi: Update all images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [16:57:16] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10RobH) Andrew CC'd. > Hanna, > Normally we email Doneva for this, but her auto-reply advises she... [16:58:32] (03PS16) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [16:59:26] (03CR) 10Jcrespo: "Waiting for your feedback on whether to still remove the module file (easy change)." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [17:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1700). [17:00:30] here here [17:00:55] 10Operations, 10DNS, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) ` Thank you Rob and Hanna! While we're at it, we'd also like the 'wmcloud.org' domain changed in the sam... [17:01:32] (03PS1) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544982 (https://phabricator.wikimedia.org/T231178) [17:01:55] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@75c0577]: GUI Updates [17:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:15] (03PS2) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544982 (https://phabricator.wikimedia.org/T231178) [17:05:28] (03PS4) 10Phamhi: docker-images:update all images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [17:08:06] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.1/extensions/VisualEditor/: Update VisualEditor for set of back-ports in wmf.1 T233320, T234564, T235959 (duration: 00m 56s) [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:13] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [17:08:13] T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array - https://phabricator.wikimedia.org/T234564 [17:08:14] T235959: Visual Editor: deleting selected text not working - https://phabricator.wikimedia.org/T235959 [17:13:32] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@75c0577]: GUI Updates (duration: 11m 37s) [17:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [17:22:41] (03PS5) 10Phamhi: Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [17:33:47] 10Operations, 10DNS, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) ` Hi Rob and Andrew, The DNS has been updated for the two domains to the two nameservers listed below... [17:33:56] (03PS1) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [17:33:58] (03PS1) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [17:34:12] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Andrew) ` Hi Rob and Andrew, The DNS has been updated for the two domains to the two nameser... [17:34:48] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [17:35:54] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [17:42:31] 10Operations, 10DNS, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) 05Open→03Resolved [17:42:41] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Andrew) 05Open→03Resolved [17:45:39] (03PS1) 10EBernhardson: secret: dummy credentials for airflow [labs/private] - 10https://gerrit.wikimedia.org/r/544993 [17:49:21] (03PS2) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [17:49:23] (03PS2) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [17:49:55] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [17:51:05] (03PS16) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [17:51:21] (03PS5) 10Andrew Bogott: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) [17:51:32] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1800). [18:00:04] andrewbogott, andrewbogott, and MatmaRex: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] * andrewbogott is here [18:00:46] Quite busy window today :) [18:00:51] I can SWAT today! [18:01:09] (03PS3) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [18:01:11] (03PS3) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [18:01:29] Urbanecm: the first of my patches is a bit scary because it touches dblists and related things. The second one is trivial (and doesn't touch anything but wikitech) [18:01:38] (03PS6) 10Phamhi: Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [18:01:42] hi [18:02:12] James already deployed my wmf.1 patches, so only wmf.2 are remaining [18:02:15] Urbanecm: I think it's standard practice anyway, but we should definitely do all the mwdebug tests we can with the first [18:02:29] andrewbogott: if that's possible, that's great [18:02:55] * andrewbogott hasn't changed config recently enough to know what "if that's possible" means [18:03:17] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [18:03:32] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [18:04:23] Urbanecm: let me know what I should do when :) [18:04:56] andrewbogott: I'm too scared to do the first patch. If you want to do your stuff, feel free to do and let me know once the air is clear for the backports. [18:06:00] So, maybe I don't understand the process. I thought the steps (for config) were 1) merge 2) pull selectively on test boxes 3) full scap [18:06:49] Usually, we want to avoid full scap, because it shakes the cluster a lot. Config is mostly synced using `scap sync-file` [18:08:13] (03PS1) 104nn1l2: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) [18:08:46] andrewbogott: I'm not confident with that patch enough to proceed, so please reschedule to a different window [18:09:45] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Volker_E) @thcipriani Who could help here further? [18:09:52] OK — to clarify, are you suggesting that I also do other things in addition to rescheduling? Or just find a different deployer? [18:10:06] (It looks like MatmaRex is also on the list of potential deployers for this window) [18:10:39] i'm definitely not because i don't even have access :D [18:10:40] andrewbogott: yes, I suggest to find someone confident to deploy that. [18:10:47] 'k [18:11:09] (03PS3) 10Dzahn: mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) [18:11:10] thanks andrewbogott [18:11:15] MatmaRex: I've +2'ed your wmf.2 backports [18:11:33] Urbanecm: would tomorrow work or does it need to be a different time of day? [18:11:56] thanks Urbanecm, i'm around to test whenever they go through [18:12:54] ack MatmaRex [18:13:38] andrewbogott: well, it's more about people - I don't know if someone knowing how wikitech works will be around tomorrow :-) [18:15:39] (03CR) 10DannyS712: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:16:03] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:16:14] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [18:16:49] (03CR) 10jerkins-bot: [V: 04-1] Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:16:55] (03PS4) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [18:16:57] (03PS4) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [18:16:59] (03PS1) 10EBernhardson: [airflow] Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 [18:17:22] (03CR) 104nn1l2: "Hi. The election starts in three days, so I think this is a top priority. Thanks in advance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:18:27] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:19:07] (03CR) 10Dzahn: [C: 03+1] "shouldn't the puppet compiler show a change on these hosts that are using the mariadb misc role and are shard m2?" [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:19:31] (03CR) 10jerkins-bot: [V: 04-1] [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (owner: 10EBernhardson) [18:23:35] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Ok, I'm onsite and going to attempt the following on ps1-22-ulsfo: 1) unplug all the data/serial/network connections (leave all power in place) 2) unseat and re-seat the NIC which may power... [18:23:47] (03CR) 10Dzahn: [C: 03+1] "right, the proxy is in between: m2-master.eqiad.wmnet is an alias for dbproxy1007.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:24:07] !log working on ps1-22-ulsfo via T235911 (it may flap but it is already ack'd as down in icinga, but not persistent) [18:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:11] T235911: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 [18:26:10] MatmaRex: your patches are ready at mwdebug1001 [18:26:51] Urbanecm: thanks, all of them? please give me a few minutes, i have to test a couple different things [18:26:57] MatmaRex: yes, all of them [18:27:08] sure, let me know once you're done [18:27:25] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.51 ms [18:27:26] (03PS5) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [18:28:35] jouncebot: next [18:28:35] In 0 hour(s) and 31 minute(s): Gerrit server migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1900) [18:28:47] (03PS2) 104nn1l2: Change the language of Votewiki to Persian (fa) temporarily for the annual ArbCom elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) [18:28:59] heads-up: in half an hour Gerrit migration, moving to new server and OS version, expect some downtime if you must merge stuff [18:30:07] !log ps1-22-ulsfo repaired (reseating its NIC rebooted its mgmt interface) Done with it and repeating on ps1-23-ulsfo via T235911 [18:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:12] T235911: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 [18:31:54] (03CR) 104nn1l2: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544995 (https://phabricator.wikimedia.org/T230614) (owner: 104nn1l2) [18:32:23] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.61 ms [18:32:40] !log ps1-23-ulsfo back online, all pdu work in ulsfo is now complete T235911 [18:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:14] 04Critical Alert for device ps1-22-ulsfo.mgmt.ulsfo.wmnet - Device rebooted [18:33:23] (03PS5) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [18:33:25] (03PS6) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [18:34:02] Urbanecm: i think the last patch, "Try using structured logging again", is actually not working. can we revert that, and go ahead with the rest? [18:34:52] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Summary of work: * confirmed in docs that the pro2 will indeed allow hot swap of its network card (the older pro1 will not) * scheduled work with @bblack for #traffic cooperation (no impact... [18:35:02] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) [18:35:19] MatmaRex: sure. Should I revert the wmf.1 version as well? [18:35:39] Urbanecm: yeah, probably. if you have the time [18:35:44] Sure [18:36:01] 10Operations, 10ops-ulsfo, 10Traffic: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) 05Open→03Resolved [18:36:12] (it's not really important, it just doesn't log stuff, everything else works. and wmf.1 is not deployed anywhere anymore, i think) [18:38:02] good point, missed that :) [18:38:14] 04Critical Alert for device ps1-23-ulsfo.mgmt.ulsfo.wmnet - Device rebooted [18:40:45] MatmaRex: syncing everything but logging patch [18:41:11] right [18:41:29] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.2/extensions/VisualEditor: SWAT: a4ab456: TreeModifier: Ignore removed nodes properly when normalizing from a text node (T235959); ecb4532: Update VE core submodule to a4ab456dc0 (T235959); a850cee: ApiVisualEditor: Always return etag with content (T233320) (duration: 00m 55s) [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:36] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [18:41:36] T235959: Visual Editor: deleting selected text not working - https://phabricator.wikimedia.org/T235959 [18:42:03] MatmaRex: should be done! [18:43:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-22-ulsfo.mgmt.ulsfo.wmnet recovered from Device rebooted [18:43:25] !log Morning SWAT done [18:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:34] MatmaRex: ping me if there's anything else to do [18:43:43] Urbanecm: thank you. everything looks good [18:43:47] great! [18:43:49] (03PS17) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [18:43:51] (03PS6) 10Andrew Bogott: labtestwikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 (https://phabricator.wikimedia.org/T229441) [18:43:53] (03PS2) 10Andrew Bogott: wikitech: Update hostnames for OpenStack endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542506 (https://phabricator.wikimedia.org/T223907) (owner: 10BryanDavis) [18:48:00] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) Please let us know what else is needd @RStallman-legalteam [18:48:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-23-ulsfo.mgmt.ulsfo.wmnet recovered from Device rebooted [18:55:04] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [18:55:57] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [18:55:59] mutante: paladox: I am around to assist :D [18:56:28] I am not sure how helpful I can be since you two have managed everything anyway. But I guess an additional pair of eyes might help :D [18:56:34] hashar thanks! [18:57:06] is that new hardware AND jessie > buster AND openjdk-11 all in one go? [18:57:08] cool [18:57:29] hashar it's all new hardware and os but not java 11 yet [18:57:34] great [18:57:35] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) - https://phabricator.wikimedia.org/T227542 (10JHedden) [18:57:36] we need at least gerrit 2.16 for that. [18:57:38] hashar: it's jessie > buster, and new hardware with 64 GB [18:57:42] but it's not jdk11 [18:57:44] it's 8 [18:58:41] also earlier today I found out we have a legacy cronjob generating a /var/www/mediawiki-extensions.txt [18:58:59] but the cron is not listed on gerrit1001 [18:59:02] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [18:59:19] arg? [18:59:37] that file is for the mediawiki extension distributor [18:59:51] maybe that is because the puppet class is not applied on gerrit1001 yet [19:00:01] (03CR) 10Ottomata: "I'm think I'm fine with this idea! When I mentioned that search could try and run their own airflow...I was kind of thinking unpuppetized" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544989 (owner: 10EBernhardson) [19:00:03] ah yeah [19:00:04] paladox, mutante, and thcipriani: Your horoscope predicts another unfortunate Gerrit server migration deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1900). [19:00:08] if !$replica { [19:00:08] class { '::gerrit::crons': [19:00:12] only on master. Problem solved :] [19:00:22] yup [19:00:28] hashar: .. i was about to start with "but it has the role" but yes :) [19:00:32] ok! [19:00:52] go! [19:01:17] * thcipriani goes [19:01:21] :) [19:03:16] mutante: okie doke, are you ready to rsync? I can go ahead and stop gerrit if so. [19:03:27] rsync commands expected: [19:03:32] rsync -avp /srv/gerrit/git/ rsync://gerrit1001.wikimedia.org/gerrit-data/git/ [19:03:39] rsync -avp /srv/gerrit/plugins/ rsync://gerrit1001.wikimedia.org/gerrit-data/plugins/ [19:03:46] paladox' task list says: [19:03:57] (also rsync indexes (/var/lib/gerrit2/review_site/indexes), and also rsync lfs objects again (/srv/gerrit/plugins/lfs)). [19:04:07] this seems new [19:04:09] Yeh [19:04:14] and we are not prepared for it [19:04:20] that's so we doin't have to run the offline indexer [19:04:39] oh [19:05:16] the other piece of that is the library jars [19:05:32] in review_site/lib [19:05:37] I think we rsynced the mysql client [19:05:44] ah, ok [19:05:49] and the javamelody lib would have been scapped over [19:05:58] i copied the mysql client jar [19:05:59] with scp [19:06:02] that was just one file [19:06:05] a while ago [19:06:49] but "rsyncing /var/lib/gerrit2/review_site" is not ready to go [19:08:50] 2.6G worth of indexes, probably not going to be a fast thing either, if this is the first time its been synced. [19:09:14] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10JHedden) [19:09:18] the gerrit $data_dir is /srv/gerrit and that's what is in the code [19:10:27] Is there a way we can edit the rsync file to copy the index over? Or we need to do that through puppet? [19:10:56] it's hackable if we disable puppet..yea [19:11:06] ok [19:11:12] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10RStallman-legalteam) Thanks @Nuria. Working on the NDA now. Can you confirm the exact access set for the NDA - would listing SSH access be clear enough? [19:11:45] mutante lets do that [19:11:55] paladox: there is another problem :( [19:12:01] oh? [19:12:01] yeah, that's /srv/gerrit is the git dirs, but the generated indexes aren't fast to rebuild. [19:12:04] remember how we applied the gerrit::migration role [19:12:09] to setup that rsync stuff [19:12:11] yup [19:12:37] well, right now that is not applied because the regular gerrit role is on it [19:13:17] so it's also ferm [19:13:27] oh [19:13:40] ah, right [19:15:00] I guess we can re apply the clas? [19:15:45] changing gerrit::server::data_dir to /var/lib/gerrit2/review_site (we would then need to change it to copy /srv/gerrit/ again) [19:16:19] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10Peachey88) [19:16:29] best if we apply it in addition [19:16:34] ok [19:16:42] removing the gerrit role will also remove stuff again [19:18:40] so /var/lib/gerrit2/review_site is 7GB. Might take a while to sync initially, even after sorting rsync ferm. [19:19:16] (03PS1) 10Paladox: Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 [19:19:29] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:19:47] mutante thcipriani ^ [19:19:55] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:20:21] (03PS2) 10Paladox: Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 [19:20:26] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:20:55] is that going to complain about duplicate users, directories, etc I wonder? [19:21:01] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:21:47] (03PS3) 10Paladox: Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 [19:21:53] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:22:24] (03PS1) 10Dzahn: gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 [19:23:08] (03CR) 10Dzahn: [C: 04-1] "the role keyword can only exist once per node, at least per style guide" [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:23:10] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (owner: 10Dzahn) [19:23:15] paladox: more like this: [19:23:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/545016 [19:23:35] Ah, thanks! [19:23:56] since we add the profile and not the role.. we need to add the Hiera part [19:24:02] in the regular gerrit role [19:24:28] * paladox abandons mine in favour of yours! [19:24:36] (03Abandoned) 10Paladox: Gerrit: Apply gerrit::migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:24:44] eh, yeah, gerrit::jetty seems to create the same users/dirs as profile::gerrit::migration [19:24:49] (03CR) 10Dzahn: "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/545016" [puppet] - 10https://gerrit.wikimedia.org/r/545013 (owner: 10Paladox) [19:25:23] mutante it's probably better to apply the class (removing gerrits), do the rsync and revert. [19:25:31] *class [19:25:58] (03PS2) 10Dzahn: gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) [19:26:33] > manifests/site.pp:225 wmf-style: node 'gerrit1001.wikimedia.org' includes class ::profile::gerrit::migration [19:26:40] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [19:26:54] (03CR) 10Paladox: gerrit: add gerrit migration role to gerrit1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [19:26:58] Duplicate declaration: Group[gerrit2] is already declared [19:27:03] yeh [19:27:18] gerrit::jetty seems to create the same users/dirs as profile::gerrit::migration [19:27:42] sigh, ok, let's do the other way then [19:28:06] paladox: ok, let's use revert [19:28:18] ok [19:29:24] (03PS1) 10Dzahn: Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/545019 [19:29:27] this one, ack? [19:29:37] that was the newer one, we did that twice [19:29:57] (03PS2) 10Dzahn: Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/545019 [19:30:10] * paladox looks [19:30:23] (03CR) 10Paladox: [C: 03+1] Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/545019 (owner: 10Dzahn) [19:30:25] compiles [19:30:25] +1'd [19:30:36] (03CR) 10Thcipriani: [C: 03+1] Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/545019 (owner: 10Dzahn) [19:31:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18960/" [puppet] - 10https://gerrit.wikimedia.org/r/545019 (owner: 10Dzahn) [19:32:46] running puppet on cobalt (noop expected) and gerrit1001 (setting up rsync ...) [19:33:15] :) [19:33:36] rsync works. pushing data from cobalt to gerrit1001 [19:33:39] (the part we did before) [19:34:05] thanks! (we should stop gerrit so that we doin't rsync then someone pushes a patch) [19:34:27] paladox: i would just run it again [19:34:31] ok [19:34:39] lets do an initial rsync and then re-run after we stop [19:34:49] ^ that, and i just wanted to test it [19:34:59] we will need another change to add the additional pathes [19:36:05] the good part.. it is just a Hiera change now [19:36:24] :) [19:36:32] paladox: can you make one to switch the data_dir from /srv/gerrit/ to /var/lib/.. [19:36:38] yup! [19:36:44] in the gerrit::migration hiera [19:37:20] do we want to do it that way? or add an additional rsync::server::module? we'll want to sync both a final time, right? [19:37:34] yes, you are right. we need to add both [19:37:43] otherwise we would stop gerrit and then how do we merge [19:37:51] paladox: ^ [19:38:05] ok [19:38:09] add a second rsync::server::module [19:38:30] also, won't we need the rsync ferm port open after adding the actual gerrit role to gerrit1001? i.e., should I be working on a patch to ensure those roles can live in harmony? [19:38:33] it's kind of "$data_dir2" but no need to make that pretty right now [19:39:20] (03PS1) 10Paladox: Gerrit: Rsync /var/lib/gerrit1 [puppet] - 10https://gerrit.wikimedia.org/r/545021 [19:39:21] ^ [19:39:25] thcipriani mutante ^ [19:39:29] err typo [19:39:49] (03PS2) 10Paladox: Gerrit: Rsync /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/545021 [19:39:54] thcipriani: well..yea, that was the first attempt. i can amend to https://gerrit.wikimedia.org/r/c/operations/puppet/+/545016 [19:39:57] or you can [19:40:31] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Rsync /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/545021 (owner: 10Paladox) [19:40:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) From Dell Support they have not been able to find any hardware errors from tsr report Hi John, I did check... [19:41:03] (03PS3) 10Paladox: Gerrit: Rsync /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/545021 [19:41:22] (03PS4) 10Paladox: Gerrit: Rsync /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/545021 [19:41:28] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545021 (owner: 10Paladox) [19:42:01] thcipriani: let me just remove the group/user part from that .. probably good enough [19:42:39] you could add a config to enable that [19:45:08] paladox: let's do both. quick fix and follow-up ? [19:45:14] ok [19:45:16] (03PS3) 10Dzahn: gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) [19:45:17] +1 [19:45:20] first rsync finished, that was /srv/gerrit/git [19:45:55] starting one for plugins.. takes only seconds. done [19:46:28] thanks! [19:47:22] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [19:47:37] manifests/site.pp:226 wmf-style: node 'gerrit1001.wikimedia.org' includes class ::profile::gerrit::migration [19:48:01] (03PS1) 10Thcipriani: gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 [19:48:12] I took a stab at it ^ [19:48:23] let's see how jerkins feels [19:48:53] thcipriani that looks nice! [19:48:54] Duplicate declaration: File[/srv/gerrit] is already declared [19:49:05] almost there [19:49:20] mutante: see https://gerrit.wikimedia.org/r/545024 [19:50:13] passes! [19:51:18] (03CR) 10Paladox: [C: 03+1] gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 (owner: 10Thcipriani) [19:51:38] ack, that is very similar but slightly nicer not to include in site.pp! [19:51:57] (03PS1) 10MaxSem: Disable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545027 (https://phabricator.wikimedia.org/T235949) [19:52:04] but still fails to compile.. [19:52:10] oh [19:52:12] in this case Function lookup() did not find a value for the name 'gerrit::server::data_dir' [19:52:38] * thcipriani moves that to common [19:53:04] also now we cant use the migration role by itself anymore. let's add the FIXME for later [19:54:28] * thcipriani does that [19:55:33] (03Abandoned) 10Dzahn: gerrit: add gerrit migration role to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/545016 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [19:56:16] (03CR) 10CDanis: [C: 03+1] Disable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545027 (https://phabricator.wikimedia.org/T235949) (owner: 10MaxSem) [19:56:49] jouncebot: now [19:56:49] For the next 2 hour(s) and 3 minute(s): Gerrit server migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T1900) [19:57:32] (03CR) 10Dzahn: [C: 03+2] Gerrit: Rsync /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/545021 (owner: 10Paladox) [19:57:42] paladox, mutante, thcipriani - am I OK to deply an emergency fix? ^ [19:57:45] (03PS2) 10Thcipriani: gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 [19:59:18] I would think so as we are currently rsyncing. [19:59:37] awsum [19:59:49] MaxSem: not taking it down just yet.. [19:59:49] (03CR) 10jerkins-bot: [V: 04-1] gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 (owner: 10Thcipriani) [20:00:00] (03CR) 10MaxSem: [C: 03+2] Disable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545027 (https://phabricator.wikimedia.org/T235949) (owner: 10MaxSem) [20:00:04] cscott, arlolra, subbu, halfak, and accraze: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T2000). Please do the needful. [20:01:08] (03Merged) 10jenkins-bot: Disable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545027 (https://phabricator.wikimedia.org/T235949) (owner: 10MaxSem) [20:01:33] Thcipriani: seems you got something stuck in the yaml :P [20:01:43] the second rsync module has been added [20:01:51] (03PS3) 10Thcipriani: gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 [20:01:53] yeah, redirecting git output fail, fixed ^ [20:01:56] Thanks mutante ! [20:02:08] anyway, puppet compiler is now happy with it https://puppet-compiler.wmflabs.org/compiler1001/18966/ [20:02:19] paladox: so let's confirm what is "/var/lib/gerrit2/review_site to gerrit1001. Also rsync lfs objects again." [20:02:33] Ok [20:02:36] that includs the config dir [20:03:59] The path to the lfs objects should have been syncd [20:04:00] When you did /srv/gerrit [20:04:01] The config path is /var/lib/gerrit2/review_site/etc [20:04:02] thcipriani: \o/ [20:04:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18966/" [puppet] - 10https://gerrit.wikimedia.org/r/545024 (owner: 10Thcipriani) [20:04:48] server::master_host is duplicate [20:04:54] but isnt an issue [20:05:09] Ok [20:05:19] (03PS4) 10Dzahn: gerrit: gerrit::migration coexist with ::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/545024 (owner: 10Thcipriani) [20:06:26] paladox: i did not do /srv/gerrit. i did /srv/gerrit/git and /srv/gerrit/plugins that's the only things we had listed [20:08:02] !log rsynced /srv/gerrit/git from cobalt to gerrit1001 (T222391) [20:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:07] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [20:08:14] !log rsynced /srv/gerrit/plugins from cobalt to gerrit1001 (T222391) [20:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:36] re-applying gerrit role on gerrit1001 now [20:09:56] thcipriani: ok, puppet looking good. [20:10:00] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/545027/ T235949 (duration: 00m 52s) [20:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:06] mutante: great! [20:10:12] mutante ah ok [20:10:16] so what is the next path [20:10:17] yeh lfs is in plugins/ [20:10:29] mutante /var/lib/gerrit2/review_site [20:10:29] ok! done then [20:12:21] !log rsyncing /var/lib/gerrit2/review_site from cobalt to gerrit1001 (T222391) [20:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:40] paladox: thcipriani: it' [20:12:55] it's running and that includes all the javamelody data right now [20:13:00] great! [20:13:04] thanks!! [20:13:48] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@0c6d34b]: Update mobileapps to d6a6e7f [20:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:57] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) @RStallman-legalteam, ssh access and access to private data up to February 15th [20:14:48] paladox: if there is stuff in /srv/gerrit that is not in ./plugins/ or ./git/ then we don't have that [20:14:56] but the rest is done [20:15:09] the only other thing should be jvmlogs [20:15:12] updates ticket check boxes with rsync pathes [20:15:20] though i defer to thcipriani if he wants that data rsync'd [20:15:45] meh, jvmlogs get rotated so regularly, it's fine to leave them [20:15:50] it's just java gc stuff [20:15:55] ok [20:16:15] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:16:51] our steps include shutting down gerrit followed by merging something [20:17:11] oh! [20:17:11] ok, I can stop gerrit [20:17:19] but then how do we merge [20:17:27] ah, right. [20:17:50] we should stop it on gerrit1001 [20:17:53] gerrit and puppet [20:17:55] +1 [20:17:59] then merge the ferm-misc change [20:18:04] run puppet on dbproxy1007 [20:18:28] stop puppet on cobalt [20:18:36] merge the gerrit switch [20:18:57] stop gerrit on cobalt [20:19:16] hi [20:19:17] I'm not sure about the dns changes [20:19:22] where they fit [20:19:27] dns change could go anytime i guess [20:19:38] as they would get the error screen on gerrit1001 [20:19:45] (e.g the typical one) [20:19:51] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@0c6d34b]: Update mobileapps to d6a6e7f (duration: 06m 02s) [20:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:01] thcipriani: is the `replication` plugin prepared for the gerrit1001 migration? [20:20:15] yup [20:20:27] gerrit1001 will rsync to gerrit2001 once it's online [20:20:33] I guess it's just changing where it runs because the targets are still gerrit2001 and github [20:20:37] err i mean replicate [20:20:42] indeed. [20:20:56] gerrit-replica runs its own replication system then? [20:21:05] let's get the mariadb thing out of the way [20:21:14] It can, but it shouldn't. [20:21:20] mutante +1 [20:21:21] we know gerrit is currently not running on gerrit1001 [20:21:35] so we can do that and allow the DB access [20:21:36] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:56] schedules downtime for that [20:23:05] thcipriani: ack? i would say next https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535966/ [20:23:59] mutante: +1 with gerrit on gerrit1001 off that seem like an easy one to merge now [20:24:03] (03CR) 10Eevans: "I'm kind of torn here. On the one-hand I am definitely pro-automation and self-service. However, we've thus far been pretty guarded about" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [20:25:07] !log gerrit1001 - puppet agent disabled - gerrit service stopped [20:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:19] (03PS4) 10Dzahn: mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) [20:25:22] ack, doing that [20:25:37] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544966 (https://phabricator.wikimedia.org/T200803) (owner: 10Alexandros Kosiaris) [20:26:23] icinga: scheduled 2 hours downtime for everything both on gerrit1001 and cobalt [20:26:43] thanks! [20:27:45] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:28:03] editing the boxes on that ticket to reflect more the order we are doing it [20:28:15] (03CR) 10Dzahn: [C: 03+2] mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [20:29:24] !log running puppet on dbproxy10017 to apply ferm change for gerrit db from gerrit1001 (T222391) [20:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:28] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [20:29:49] I wonder how we will be able to do https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541110/ with gerrit stopped on cobalt? Merge before? [20:29:50] and https://gerrit.wikimedia.org/r/#/c/operations/dns/+/541111/ [20:31:06] I think if we stop puppet on cobalt and gerrit2001, then merge the ops/puppet change before stopping gerrit. The DNS one: I don't know, maybe mutante has suggestions for the right OOO. [20:31:22] (order of operations) [20:31:47] ah, +1 thcipriani [20:31:49] merge and immediately stop gerrit on cobalt, I guess [20:32:05] i wanted to test if the db connection works but i did not get mysql client back yet [20:32:19] but I don't know the deploy procedure there, so that may not work [20:32:45] does netcat work? nc -vz [db2-url] -w 1 3306 ? [20:33:24] yes, 3306 (mysql) open [20:33:30] just not testing the grants [20:33:40] but since that is on dbproxy.. it should not change [20:33:42] so all good [20:34:01] cool [20:34:09] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:34:56] ok, so next step would be stop puppet agent on cobalt (and gerrit2001?), and merge the master changes in ops/puppet? [20:35:53] yup [20:35:59] (and dns) [20:36:04] (03CR) 10Cwhite: [C: 03+1] logstash: config readable by logstash only by default [puppet] - 10https://gerrit.wikimedia.org/r/544218 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [20:36:40] (03PS6) 10EBernhardson: [airflow] Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 [20:36:42] (03PS7) 10EBernhardson: [airflow] Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 [20:37:06] !log disabled puppet on cobalt and gerrit2001 [20:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:41] * thcipriani edits task [20:38:23] thcipriani: i think best we can do is merge it and immediately stop gerrit after it's merged on puppetmaster [20:38:34] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [20:38:35] mutante: how does it get deployed? [20:38:48] thcipriani: running authdns-update on ns0 [20:38:49] I've never actually touched that repo [20:38:52] it's like puppet-merge [20:38:54] but another script [20:39:15] ah, ok, then yeah, merge that, stop gerrit sounds fine [20:39:15] so i can..merge.. then we stop gerrit and then i actually apply [20:39:42] but first we wanted to rsync again [20:40:08] well, not first.. after gerrit service is stopped [20:40:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [20:40:56] I think ^ is right now [20:41:21] (03PS6) 10Dzahn: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 (owner: 10Paladox) [20:41:27] just saw, nice. doing next [20:42:13] links gerrit changes to our ticket [20:42:13] (03PS7) 10Dzahn: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [20:42:35] (03CR) 10Dzahn: [C: 03+2] Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [20:43:25] nice! [20:44:26] and we did not need gerrit::server::host anymore, right paladox [20:44:29] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch (duration: 00m 52s) [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:22] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [20:45:24] mutante yup [20:46:01] we default to gerrit.wikimedia.org in gerrit's common yaml file [20:46:12] (03PS4) 10Dzahn: Switch gerrit.wikimedia.org backend to gerrit1001 [dns] - 10https://gerrit.wikimedia.org/r/541111 (owner: 10Paladox) [20:46:31] yep [20:47:57] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:48:32] (03CR) 10Dzahn: [C: 03+2] Switch gerrit.wikimedia.org backend to gerrit1001 [dns] - 10https://gerrit.wikimedia.org/r/541111 (owner: 10Paladox) [20:48:53] thcipriani: DNS merged but not deployed [20:49:17] mutante: ok, looks like we're ready to shutdown gerrit? [20:49:22] yes [20:50:28] alright, here goes [20:50:37] !log stopping gerrit on cobalt [20:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:10] schedules downtime for the actual service IP in icinga [20:51:31] !log rsyncing gerrit-data/git again [20:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:18] !log rsyncing gerrit-data/plugins and /var/lib/gerrit2/review_site/ again [20:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:34] paladox: ^. db files [20:52:50] PROBLEM - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:52:58] mutante thanks! [20:53:10] these alerts will be hard to turn off [20:53:18] but ok [20:53:34] PROBLEM - Check the last execution of git_pull_charts on contint2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:54:13] mutante: let me know when rsyncs are complete, I can move the folders for javamelody [20:54:32] thcipriani: done! do it [20:54:42] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:50] mutante: done [20:55:06] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:55:20] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:45] mutante: I'll go ahead and run puppet on cobalt and contint1001 unless you have the command handy. cobalt first, I'd guess. [20:55:48] PROBLEM - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:48] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:55] thcipriani: confirmed! so we did not add yet when we deploy DNS change [20:56:05] let me ACK some of these because noise [20:56:18] PROBLEM - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:56:32] mutante: if you want to handle the DNS change we can probably do that in parallel with puppet runs [20:56:42] ACKNOWLEDGEMENT - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:42] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:56:42] ACKNOWLEDGEMENT - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:42] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on contint2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:56:42] ACKNOWLEDGEMENT - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:43] ACKNOWLEDGEMENT - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:43] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit migration https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:57:05] thcipriani: ok, tell me when it's running [20:57:21] !log running puppet on cobalt [20:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:33] !log running puppet on gerrit1001 [20:57:35] right? [20:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:28] mutante: yep [20:58:41] running now [20:58:44] deploying DNS change [20:59:12] oh wait.. i see a puppet error [20:59:23] runs it a second time [20:59:25] puppet run done on cobalt. I saw an error about an All-Avatars repo. [20:59:30] yes, me too [20:59:32] being pulled. [20:59:48] it tries to pull from itself [20:59:52] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T2100). [21:00:06] * thcipriani headdesk [21:00:20] and there are dependencies on the apache setup.. uhm [21:00:38] hmm [21:00:45] paladox: can we remove this dependency: [21:00:48] PROBLEM - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:00:51] Gerrit::Proxy/File[/etc/apache2/ports.conf]: Dependency Exec[git_pull_All-Avatars] has failures: true [21:00:54] yup [21:01:00] apache2 setup should not depend on that pull working [21:01:23] thcipriani: i guess i need to deploy the DNS change [21:01:28] and in 5 minutes it should work, heh [21:01:36] oh good [21:01:59] oh crap [21:02:17] that also relies on pulling [21:02:25] 10Operations, 10Performance-Team, 10Traffic: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10Krinkle) [21:02:36] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01016 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:03:04] 10Operations, 10Wikimedia-Incident: September 2019 DoS attacks [Public] - https://phabricator.wikimedia.org/T232224 (10RhinosF1) Any chance of an Incident report? [21:03:13] mutante: in the past bblack has manually pushed DNS changes while gerrit was unreachable [21:03:17] mutante: it's possible but messy AIUI [21:04:18] gerrit-replica is still up FWIW [21:04:27] it could pull from there if that's somehow an easy change to make [21:06:22] Maybe we can get it to clone from it's self thcipriani ? [21:06:26] git clone [21:07:17] is gerrit-replica up to date with the change? [21:07:38] cdanis: ack.. uhm.. i am reading https://wikitech.wikimedia.org/wiki/DNS#Update_DNS_if_gerrit_or_DNS_are_down_(on_an_emergency_only) [21:07:50] that's a good point [21:07:59] mutante: we can do something easier [21:08:17] mutante: does gerrit-replica have the right version of the repo? [21:08:22] gerrit-replica should be up-to-date, yes [21:08:26] * paladox checks [21:08:27] I can see the change there [21:08:33] (on disk) [21:08:58] https://gerrit-replica.wikimedia.org/r/plugins/gitiles/operations/puppet/ [21:08:59] yup [21:09:16] paladox: it's the DNS change [21:09:25] https://gerrit-replica.wikimedia.org/r/plugins/gitiles/operations/dns/ [21:09:26] also there [21:09:33] https://gerrit-replica.wikimedia.org/r/plugins/gitiles/operations/dns/ [21:09:34] ok [21:09:37] yup [21:09:39] one minute [21:09:41] cdanis: https://gerrit-replica.wikimedia.org/r/plugins/gitiles/operations/dns/+/6a5b47bd1d6fba36ff97bb4dbbf0c105e53bbe54 [21:09:44] thanks! [21:10:39] paladox: we still need to fix the dependency issue .. no apache site config [21:10:51] oh! [21:11:01] can we patch the puppet repo on the puppetmaster? [21:11:23] (as gross as that request sounds) [21:12:01] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin A:dns-auth 'perl -p -i".bak" -e "s/gerrit\./gerrit-replica./" /etc/wikimedia-authdns.conf' [21:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:20] ok, dns change is merging now [21:12:26] thanks cdanis! [21:13:08] we now need to remove whatever all-avatars is from gerrit::jetty [21:13:16] mutante: done [21:13:24] cdanis: thanks! [21:13:36] and I think puppet will restore /etc/wikimedia-authdns.conf to usual but I will check [21:13:52] 10Operations, 10Performance-Team, 10Traffic: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ori) Still doesn't work for me. [21:14:06] thcipriani +1 [21:14:23] https://wikitech.wikimedia.org/wiki/Puppet#Git_is_down_(and_requires_a_puppet_change_to_put_it_back) :( [21:14:29] it says TODO [21:14:32] yes [21:14:37] it is Hard(tm) apparently [21:14:57] i can copy the apache config files from cobalt [21:15:06] Can we edit puppet, but then change it back and do it through gerrit (once it's back) [21:15:12] mutante oh! [21:15:17] cdanis: haha, is your PS1 telling you it's beer o'clock? [21:15:19] +1 [21:15:27] ori: 😂 yes it is [21:15:43] that's awesome [21:15:52] it's coffee until 3pm, and then 3pm-5pm is tea [21:16:01] mutante: +1 [21:16:29] dns change seems live to me [21:16:34] * paladox gets a different ip [21:16:42] !log previous cumin invocation was to unblock gerrit migration; will be automatically restored to usual on next puppet run. T222391 [21:16:45] cdanis: thank you for the intervention when your PS1 shows beer. [21:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:46] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [21:18:04] !log copied apache config for gerrit.wm.org site from cobalt to gerrit1001, restarted apache2 [21:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:28] thcipriani: done. running puppet again [21:19:09] gerrit's back [21:19:16] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:18] ^ just noticed as well. [21:19:19] wow, man :) [21:19:40] i'm glad we could avoid the puppetmaster hack [21:19:45] and that cdanis had the DNS hack [21:19:46] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:00] puppet works now on gerrit1001 too [21:20:04] just finished [21:20:18] phew. OK. I'm going to kick off the indexer for changes [21:20:29] I'll remove All-Avatars from the gerrit module now! [21:20:38] paladox: let's still remove that dependency between avatars and having the site [21:20:45] ack:) thx [21:20:46] yup [21:21:07] (03PS1) 10Paladox: Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 [21:21:30] !log copied apache config for gerrit.wm.org site from cobalt to gerrit1001, restarted apache2, ran puppet again. gerrit back up (T222391) [21:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:45] (03PS2) 10Paladox: Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 [21:21:50] paladox: that is the stalled change right [21:21:57] i mean stalled ticket [21:22:16] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:31] Stalled ticket? [21:23:00] paladox: avatar support [21:23:02] !log ssh -p 29418 gerrit.wikimedia.org -- gerrit index start changes --force [21:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:13] ah, right. [21:23:14] paladox: the change somehow does not explain why there was the dependency on the apache site [21:23:15] yup [21:23:30] I'm not sure why the apache template dependant on it [21:23:36] i carn't find the dependacy [21:23:45] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545066 (owner: 10Paladox) [21:23:58] hashar: Notice: /Stage[main]/Gerrit::Crons/Cron[list_mediawiki_extensions]/ensure: created [21:24:03] didn't it touch /var/www ? [21:24:04] great :) [21:24:06] RECOVERY - Check the last execution of git_pull_charts on deploy2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:24:15] thcipriani yup [21:24:25] had some implicit dependency because of that, I'd assume [21:24:41] hashar ci still works :P [21:24:50] RECOVERY - Check the last execution of git_pull_charts on contint2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:24:55] thcipriani: paladox: https://en.wikipedia.org/wiki/High_five [21:24:59] https://puppet-compiler.wmflabs.org/compiler1001/318/gerrit1001.wikimedia.org/ [21:25:06] mutante :D [21:25:22] it might be worth it to run a cumin invocation to retry all failed puppet runs [21:25:28] if no one objects I'll kick that off [21:25:30] Do we want to tweak the heap? [21:25:44] cdanis: for the git pulls? sure, please do [21:26:05] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b 30 -p 95 '*' 'run-puppet-agent -q --failed-only' [21:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:12] looks at icinga [21:26:16] so we need the cron to be run manually for https://gerrit.wikimedia.org/mediawiki-extensions.txt [21:26:24] or just rsync the file :) [21:26:40] it's just one server left with alerts. deploy1001 [21:26:43] thcipriani mutante https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545066/ [21:27:18] RECOVERY - Check the last execution of git_pull_charts on contint1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:27:47] (03PS1) 10Alex Monk: deployment-prep: Fix deploy-service access [puppet] - 10https://gerrit.wikimedia.org/r/545067 [21:27:53] !log gerrit1001 manually running command from "list_mediawiki_extensions" cron (T222391) [21:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:57] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [21:28:11] hashar: /var/www/mediawiki-extensions.txt: ASCII text [21:28:19] magic [21:28:53] ! [remote rejected] HEAD -> refs/publish/production/deployment-prep-fix-deploy-service (internal server error: Error inserting change/patchset) [21:28:54] error: failed to push some refs to 'ssh://krenair@gerrit.wikimedia.org:29418/operations/puppet.git' [21:29:03] attempted to push PS2 to https://gerrit.wikimedia.org/r/545067 [21:29:19] you likley hit a lock [21:29:33] (03PS2) 10Alex Monk: deployment-prep: Fix deploy-service access [puppet] - 10https://gerrit.wikimedia.org/r/545067 (https://phabricator.wikimedia.org/T236103) [21:29:40] cumin invocation almost done, puppet run alerts should be clearing [21:29:41] (03PS2) 10Jhedden: wikimedia.cloud: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [21:29:57] Caused by: com.google.gerrit.server.update.UpdateException: com.google.gerrit.server.git.LockFailureException: Update aborted with one or more lock failures: PackedBatchRefUpdate[ [21:29:57] CREATE: 0000000000000000000000000000000000000000 8b3b97778a11d30af9d66dd272bf15456828b30e refs/changes/67/545067/2 (REJECTED_OTHER_REASON: transaction aborted) [21:29:57] UPDATE: 52a268577d564e9cde497b68aa8956de6db3c527 b971b31465157435cc1393785e1128c2e3a233f1 refs/changes/67/545067/meta (LOCK_FAILURE) [21:29:57] ] [21:29:58] cdanis: thanks, there was only 1 left (and the meta one "widespread") [21:30:33] rescheduling service check on deploy1001 icinga alert [21:30:44] RECOVERY - Check the last execution of git_pull_charts on deploy1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:31:18] Krenair: yeah, gerrit has its own transaction system on top of git actions. Sometimes you hit a lock. Does pushing again work? [21:31:30] 19<wikibugs> (03PS2) 10Alex Monk: deployment-prep: Fix deploy-service access [puppet] - 10https://gerrit.wikimedia.org/r/545067 (https://phabricator.wikimedia.org/T236103) [21:31:31] looks like it [21:31:32] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002211 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:31:36] cool [21:31:37] yep [21:31:39] ty [21:31:46] (03CR) 10Jhedden: wikimedia.cloud: add initial zone file (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/544175 (https://phabricator.wikimedia.org/T235846) (owner: 10Arturo Borrero Gonzalez) [21:31:58] RECOVERY - Check systemd state on deploy2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:04] don't think I've heard of this actually affecting users in 7-8 years of gerrit :) [21:32:49] Krenair: we just switched gerrit to a new server [21:33:06] there was the DNS change involved with 5 min TTL [21:33:26] It's happen a few times before [21:33:36] we have a task some where on this [21:34:00] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [21:34:38] it started since the move to notedb on the backed rather than a real database [21:35:08] i think this may be fixed in a newer gerrit version [21:35:22] it's change metadata lock rather than something in the git repos [21:35:53] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10CDanis) something to note/fix for future migrations: one option for... [21:38:26] MiB Mem : 64047.1 total, 5105.1 free, 23874.9 used, 35067.1 buff/cache [21:38:33] regarding the new RAM ^ [21:38:41] we should then raise the jvm heap :) [21:38:45] +1 [21:38:51] and the Edge space [21:39:55] but I guess [21:40:02] let it go as is on the new server for a day or two :] [21:40:12] and https://gerrit.wikimedia.org/mediawiki-extensions.txt works [21:40:18] which is hmm .. old/legacy [21:40:30] but used by the ExtensionDistributor extension [21:43:45] paladox: mutante: thcipriani: I guess congratulations? :_] [21:44:21] "congratulations?" as an interrogative seems about right :) [21:44:42] thanks hashar [21:44:48] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=usedMemory [21:44:54] :D [21:44:57] thanks hashar [21:45:22] mutante: though this graph is half true :-] [21:46:11] https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=activeThreads [21:46:21] lower threads! [21:46:31] hashar: we copied the data and renamed the dir in it.. heh, yea [21:46:33] Java memory used: 9,740 Mb / 20,480 Mb [21:46:39] (03CR) 10Bstorm: host monitoring: add optional contact group for mgmt interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [21:47:18] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10Urbanecm) >>! In T233271#5538769, @sbassett wrote: > Re:** CheckUser**, there was a recent security patch (T207094, [[ https://gerrit.wikimedia.org/r/539643 | backport to master ]]) which //did// suffer from... [21:47:39] hmm [21:47:43] Eden is still growing super fast [21:47:48] also kudos and thanks to mutante and paladox. I feel like I played a minor role in this one because you all handled so much. [21:48:00] we're reindexing, Eden is probably flailing a bit over that [21:48:13] thanks to all involved indeed [21:48:15] thanks thcipriani! you definitly helped more then you realised :P [21:48:16] indeed [21:48:18] *to you [21:48:24] and thanks to mutante! [21:48:54] now we get the fun of retuning gerrit to make use of our fancy new hardware [21:49:01] \o/ [21:49:07] but we get way more experience nowadays! [21:49:19] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10sbassett) [21:49:28] I am about to head to bed, but do ping on the upgrade task about whatever could use to be watched out tomorrow [21:49:38] that's true: won't be as painful or as frantic as the first tuning go-round [21:49:44] but seems it is mostly a drop-in replacement [21:50:08] * hauskater robs hashar 's matelas [21:51:56] ALSO OCT 22ND IS INTERNATIONAL CAPS LOCK DAY [21:51:59] paladox: thank you too for all the patches and working with upstream [21:52:06] :) [21:52:07] hashar: LOL [21:52:27] unfortunately the official website is down :-\ [21:52:39] heh [21:52:51] https://www.daysoftheyear.com/days/caps-lock-day/ [21:53:04] https://web.archive.org/web/2018*/capslockday.com [21:53:05] hashar: slashdotted / CAPS-dotted [21:53:27] seems like the site is no more maintained :-\ such a pity, it used the typical 1990's design [21:53:29] "The best way to Celebrate Caps Lock Day is quite simple, don’t use caps lock!" [21:53:38] we did it wrong [22:00:20] paladox: thcipriani: ah yea.. we still have this open https://phabricator.wikimedia.org/T233714 [22:00:50] but it is not directly related to migration [22:00:54] oh, i cannot view that :( [22:01:00] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10colewhite) p:05Triage→03High [22:08:00] (03PS1) 10Jhedden: openstack: patch python-designateclient header values [puppet] - 10https://gerrit.wikimedia.org/r/545072 (https://phabricator.wikimedia.org/T235863) [22:08:53] subbu: gerrit is back, fwiw. i just saw your comment about the deployment [22:10:48] mutante, ok reg gerrit. what comment are you referring to btw? [22:11:15] subbu: Bali wikipedia https://phabricator.wikimedia.org/T235837#5593217 [22:11:23] (03PS7) 10Paladox: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 [22:11:36] (03PS8) 10Paladox: Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 [22:11:56] (03Abandoned) 10Paladox: Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [22:12:05] (03Abandoned) 10Paladox: Gerrit: Lower TTL to 300 [dns] - 10https://gerrit.wikimedia.org/r/541393 (owner: 10Paladox) [22:12:05] mutante, ah. ok. [22:13:19] we need to deploy to beta cluster to test some parsoid/js changes before pushing to prod .. looks like bea cluster deploy should be fixed now for services .. but at this point, it feels better to wait till tomorrow morning. [22:13:49] subbu: ack, i saw some of that [22:16:16] (03PS3) 10Paladox: Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) [22:19:54] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:20:08] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:20:44] PROBLEM - Apache HTTP on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:23:54] !log mw1340 - restarting php7.2-fpm, restarting apache2 [22:23:56] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:42] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:24:54] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 80885 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:26:29] (03CR) 10Dzahn: [C: 04-1] "if we are removing it let's remove all of it. the important part is that the apache setup does not rely on it though." [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:28:23] (03PS4) 10Paladox: Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) [22:28:31] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:29:06] (03CR) 10Dzahn: "looks like NOT setting $avatars_host should have also avoided it" [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:30:22] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [22:33:08] (03PS5) 10Paladox: Gerrit: Remove All-Avatars from gerrit's module [puppet] - 10https://gerrit.wikimedia.org/r/545066 (https://phabricator.wikimedia.org/T191183) [22:34:05] paladox: i guess i had a duplicate change and ehmm https://gerrit.wikimedia.org/r/q/topic:%2522gerrit1001%2522+(status:open) :) [22:34:20] both of us did stuff twice, heh [22:34:22] lol [22:34:42] xD [22:34:50] (03PS2) 10Dzahn: gerrit::migration: switch master to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541382 [22:38:23] (03CR) 10Dzahn: [C: 03+2] gerrit::migration: switch master to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541382 (owner: 10Dzahn) [22:44:43] (03PS5) 10Dzahn: mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) [22:45:05] thcipriani around? [22:45:12] yes [22:45:30] mutante just made a point that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535966/ is showing open even though he merged it [22:45:31] yea, so that wasnt a duplicate or what [22:46:13] checking [22:46:28] oh look [22:46:31] it's empty after rebase [22:47:41] leads me back to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535966/ [22:47:44] the merged change [22:48:36] paladox: and this one i think is not really related to setting up gerrit1001 or was it https://gerrit.wikimedia.org/r/c/operations/puppet/+/536704 [22:48:49] these 2 are the remaining opens that are open in our topic branch [22:49:01] i was just checking if that was all done [22:49:34] that's not realated, nope [22:49:40] that was for me to scap in labs [22:49:48] changed topic from gerrit1001 to gerrit [22:50:13] it's missing patchset 4 [22:50:57] uhmm. indeed [22:51:01] hmm [22:51:03] on disk [22:52:31] (03PS1) 10Alex Monk: profile::acme_chief::cloud: Require python3-designateclient etc. [puppet] - 10https://gerrit.wikimedia.org/r/545081 (https://phabricator.wikimedia.org/T235252) [22:53:19] thcipriani: on both servers? [22:53:31] mutante: only on gerrit1001 [22:54:05] hrmm.. but we rsynced [22:54:07] https://phabricator.wikimedia.org/P9418 [22:54:23] restricted for me :( [22:56:06] thcipriani: i dont see why but should we rsync another time then? [22:56:23] i mean i logged it too.. odd [22:58:29] I don't think so. packed-refs has definitely changed since the migration [22:59:09] ok [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191021T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:46] mutante: I'm going to revert your rebase. Then I think we should maybe try to re-sync the objects directory from ops/puppet? [23:01:06] thcipriani: ok [23:01:07] the commits to the meta branch that show that this patchset merged all seem to exist as loose objects in the repo [23:01:17] ok [23:01:20] * thcipriani does revert [23:01:59] but i need to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/541382 as well [23:02:14] will that make it worse [23:02:26] nope [23:02:31] gitiles shows it as merged [23:03:22] A similar problem affected another puppet patch recently - https://phabricator.wikimedia.org/T234533 [23:03:25] not sure if related [23:03:26] (03PS1) 10Dzahn: Revert "gerrit::migration: switch master to gerrit1001" [puppet] - 10https://gerrit.wikimedia.org/r/545084 [23:04:15] (03CR) 10Dzahn: [C: 03+1] "this only affects the rsync setup, it's not the master switch" [puppet] - 10https://gerrit.wikimedia.org/r/545084 (owner: 10Dzahn) [23:05:21] argg. wait. need to double check this [23:05:54] since we merged the migration::role [23:06:11] paladox: this one is unrelated [23:06:17] ok [23:06:48] (03CR) 10MarcoAurelio: "This is now an empty commit." [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [23:07:07] oh [23:07:32] damn, another one?! [23:07:38] mutante: if you would, go ahead and sync the object directory for ops/puppet. Don't sync with --delete though, and only sync the objects directory. /srv/gerrit/git/operations/puppet.git/objects [23:07:44] paladox: don't touch it [23:07:47] ok [23:09:35] 10Operations, 10Performance-Team, 10Traffic: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10colewhite) p:05Triage→03Normal [23:10:37] thcipriani: done! [23:10:53] luckily i did not have to revert that change as i thought [23:11:42] !log rsynced operations/puppet.git/objects from cobalt to gerrit1001 (and backup in /root) (T222391) [23:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:47] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [23:12:10] my change is back to PS3 [23:12:34] mine still shows empty :( [23:12:40] also needs a reindex [23:12:43] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10colewhite) a:03colewhite [23:12:45] mutante: should be fixed now [23:12:49] (after a reindex) [23:13:15] thcipriani: i dont get why it happened but i'm glad :) [23:13:25] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10colewhite) a:03colewhite [23:13:45] I...don't get why it happened either [23:13:58] for the record I did: [23:14:40] yes, it shows as merged now as it should [23:15:11] !log ops/puppet:sudo -u gerrit2 git update-ref refs/changes/66/535966/meta d6909e0537b7fe7ca6fb8989e22ed5e1c1e46e89 && sudo -u gerrit2 git update-ref refs/changes/66/535966/meta 8494c28eee163aaa48f6274e899e0e79f3a757f3 on gerrit1001 [23:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:32] ok, lemme see if I can do the same for paladox's [23:15:38] i am looking at https://gerrit.wikimedia.org/r/q/topic:%22gerrit1001%22+(status:open%20OR%20status:merged) and in that overview it's still open [23:15:40] thanks thcipriani! [23:16:28] mutante: hrm, still doing a lot of indexing on the server. I hope it'll correct itself after indexing is complete. [23:16:48] ack [23:16:50] it's correct on disk now [23:16:52] FWIW [23:17:28] that's what should matter, yea [23:20:10] rsync -avun --delete $TARGET $SOURCE |grep "^deleting " will give you a list of files that do not exist in the target-directory. [23:20:18] if we want to run something like this [23:20:25] note the -n for dry-run [23:27:35] paladox: your patchset is fixed [23:27:43] thcipriani thanks!! [23:31:53] Should the topic be updated? [23:32:29] yes [23:36:02] (03PS1) 10Dzahn: cumin: update which server is the kafka-main canary [puppet] - 10https://gerrit.wikimedia.org/r/545094 [23:51:12] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [23:59:01] (03PS3) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544982 (https://phabricator.wikimedia.org/T231178)