[00:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T0000). [00:40:52] (03PS4) 10Jforrester: [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [00:41:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [01:14:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:14:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:14:45] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:14:49] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:15:27] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:15:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:15:51] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:15:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:16:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:16:36] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10DannyS712) [01:17:01] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:17:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:17:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:17:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:17:55] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:17:59] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:18:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:19:02] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [01:19:02] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:24:55] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [01:53:15] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:04:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:06:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:25:23] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1566659363[6](2019-09-15T13:39:44.466Z), enwiki_content_1546970425[3](2019-09-15T13:39:54.892Z) Mathew.onipe Some nodes are down. I will monitor this to see if its worth raising a task - The acknowledgement expires at: 2019-09-19 18:22:52. https://wikitech.wikimedia.org/wiki/Search%23Administration [02:42:53] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26853320 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:47:39] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 724128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:48:50] (03PS3) 10Mathew.onipe: query_service: change wdqs module to query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:34:19] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:11] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:09:13] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10BBlack) We'll also need to normalize the incoming `Accept` headers up in the edge... [04:18:45] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:03] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) [04:50:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) [04:50:18] 10Operations, 10ops-eqiad: Power issue in eqiad A1 - https://phabricator.wikimedia.org/T233248 (10Marostegui) [04:56:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:57:42] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:08:15] (03PS1) 10Marostegui: mariadb: Decommission db2055 [puppet] - 10https://gerrit.wikimedia.org/r/537786 (https://phabricator.wikimedia.org/T233186) [05:09:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2055 [puppet] - 10https://gerrit.wikimedia.org/r/537786 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:10:27] 10Operations, 10DBA: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) [05:11:27] !log Remove db2055 from tendril and zarcillo T233186 [05:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:30] T233186: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 [05:11:54] !log Stop MySQL on db2055 for decommission T233186 [05:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:53] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) a:05Marostegui→03RobH [05:13:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) This host is ready for #dc-ops to decommission [05:13:32] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:34:33] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) [05:34:49] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) p:05Triage→03Normal [05:35:17] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:35:20] 10Operations, 10Traffic, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) 05Open→03Resolved [05:39:12] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Vgutierrez) [06:02:02] (03PS1) 10Vgutierrez: acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) [06:02:15] ^^ getting rid of a lot of hacks <3 [06:04:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Temporarily pool db1089 into enwiki logpager T223151', diff saved to https://phabricator.wikimedia.org/P9130 and previous config saved to /var/cache/conftool/dbconfig/20190919-060440-marostegui.json [06:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:44] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [06:18:44] !log Sanitize hiwikisource on db1124:3313 and db2094:3313 T219374 [06:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:47] T219374: Prepare and check storage layer for hi.wikisource - https://phabricator.wikimedia.org/T219374 [06:22:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10wiki_willy) a:03Jclark-ctr [06:24:58] (03CR) 10Elukey: Adding config for friendly values on netflow dataset (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) (owner: 10Nuria) [06:28:10] (03PS2) 10Vgutierrez: acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) [06:29:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10wiki_willy) [06:30:12] (03PS3) 10Elukey: turnilo: add config for friendly values on netflow dataset [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) (owner: 10Nuria) [06:31:08] (03CR) 10Vgutierrez: [C: 03+1] "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18420/" [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [06:31:27] (03CR) 10Elukey: [C: 03+2] turnilo: add config for friendly values on netflow dataset [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) (owner: 10Nuria) [06:46:12] (03PS1) 10Marostegui: Revert "wmnet: Point m1-master to dbproxy1014" [dns] - 10https://gerrit.wikimedia.org/r/537790 [06:46:42] (03PS2) 10Marostegui: Revert "wmnet: Point m1-master to dbproxy1014" [dns] - 10https://gerrit.wikimedia.org/r/537790 [06:47:17] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Point m1-master to dbproxy1014" [dns] - 10https://gerrit.wikimedia.org/r/537790 (owner: 10Marostegui) [06:53:57] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:01:39] !log reimaging restbase2012 to stretch T224553 [07:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:42] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [07:02:47] (03PS2) 10Muehlenhoff: restbase2012: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536567 [07:11:48] (03CR) 10Muehlenhoff: [C: 03+2] restbase2012: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536567 (owner: 10Muehlenhoff) [07:17:51] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [07:19:47] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 44 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [07:21:17] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more logpager weight to db1089 T223151', diff saved to https://phabricator.wikimedia.org/P9131 and previous config saved to /var/cache/conftool/dbconfig/20190919-072234-marostegui.json [07:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:39] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [07:27:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:27:35] (03PS3) 10Vgutierrez: acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) [07:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:27] 10Operations: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10MoritzMuehlenhoff) [07:29:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:37] (03CR) 10jerkins-bot: [V: 04-1] acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [07:29:59] (03PS1) 10Kosta Harlan: (wip) GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) [07:30:00] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [07:32:56] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 44 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [07:35:14] (03PS4) 10Vgutierrez: acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) [07:37:15] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10MoritzMuehlenhoff) [07:37:31] (03CR) 10Muehlenhoff: "I created https://phabricator.wikimedia.org/T233281 for that" [puppet] - 10https://gerrit.wikimedia.org/r/537132 (owner: 10Muehlenhoff) [07:38:05] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable Special:Homepage for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537803 (https://phabricator.wikimedia.org/T233066) [07:39:22] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Marostegui) This is what m5 has at the moment: ` +------------------------+ | Database | +------------------------+ | designate | | designate_pool_m... [07:39:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:39:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] !log rebooting failoid1001 for kernel update [07:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:29] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo) 05Open→03Resolved I can see now 24, thanks! megacli -PDList -aALL | grep 'Device Id' | wc -l 24 [07:41:33] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [07:41:50] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10raja_wmde) @Franziska_Heine is my manager [07:44:28] RECOVERY - Check systemd state on eventlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:22] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Fix incorrect channel name for TranslationNotifications extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [07:48:11] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:48:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537804 (https://phabricator.wikimedia.org/T233065) [07:50:11] (03CR) 10Vgutierrez: [C: 03+2] acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [07:50:21] (03PS5) 10Vgutierrez: acme_chief,ATS,tlsproxy: Move to acme-chief centrally managed OCSP responses [puppet] - 10https://gerrit.wikimedia.org/r/537789 (https://phabricator.wikimedia.org/T232988) [07:50:25] (03Abandoned) 10Kosta Harlan: WIP: Enable GrowthExperiments for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534789 (https://phabricator.wikimedia.org/T232060) (owner: 10Kosta Harlan) [07:54:49] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:57:51] (03PS2) 10Elukey: Remove cloudvirtan100X references [dns] - 10https://gerrit.wikimedia.org/r/537651 (https://phabricator.wikimedia.org/T225128) [07:58:20] (03CR) 10Elukey: [C: 03+2] Remove cloudvirtan100X references [dns] - 10https://gerrit.wikimedia.org/r/537651 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [08:06:48] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Krenair) It's the labspuppet database, yes. Note that the toolforge project has its own puppetmasters and only they were talking to the central puppetmaster. [08:07:11] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:07:45] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Marostegui) What we normally do is, rename all the tables on the given database and leave it like that for a few days to check that nothing is really using it, and then drop it. [08:08:57] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Krenair) Sounds like a good idea, let's do it. [08:10:03] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:11:30] !log Rename tables on db1133:labspuppet T233281 [08:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:34] T233281: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 [08:11:48] jouncebot: now [08:11:49] No deployments scheduled for the next 2 hour(s) and 48 minute(s) [08:11:52] jouncebot: next [08:11:53] In 2 hour(s) and 48 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1100) [08:11:59] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:12:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-a-eqiad# show | compare [edit interfaces xe-2/0/24] - description cloudvirtan1002; +... [08:12:58] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Marostegui) a:03Marostegui Done: ` root@db1133.eqiad.wmnet[labspuppet]> show tables; +-------------------------+ | Tables_in_labspuppet | +-------------------------+ | TO... [08:13:10] (03CR) 10Urbanecm: [C: 03+2] Change configuration of AbuseFilter extension for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533747 (https://phabricator.wikimedia.org/T231750) (owner: 10Zoranzoki21) [08:14:12] (03Merged) 10jenkins-bot: Change configuration of AbuseFilter extension for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533747 (https://phabricator.wikimedia.org/T231750) (owner: 10Zoranzoki21) [08:14:33] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/CheckUser/: security T207094 (duration: 01m 06s) [08:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:37] (03CR) 10jenkins-bot: Change configuration of AbuseFilter extension for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533747 (https://phabricator.wikimedia.org/T231750) (owner: 10Zoranzoki21) [08:15:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:15:47] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:17:37] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable Special:Homepage for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537803 (https://phabricator.wikimedia.org/T233066) (owner: 10Kosta Harlan) [08:18:14] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) ` elukey@asw2-b-eqiad# show | compare [edit interfaces xe-4/0/5] - description cloudvirtan1004; +... [08:19:26] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) (owner: 10Filippo Giunchedi) [08:20:12] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] (03PS1) 10Vgutierrez: acme_chief: Remove update-ocsp.d leftovers [puppet] - 10https://gerrit.wikimedia.org/r/537923 (https://phabricator.wikimedia.org/T232988) [08:21:57] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/CheckUser/: revert T207094 (duration: 01m 04s) [08:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I'm a bit surprised we haven't done this before. Go for it!" [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [08:24:31] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10Franziska_Heine) Approved! Franziska [08:26:18] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable help panel for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537804 (https://phabricator.wikimedia.org/T233065) (owner: 10Kosta Harlan) [08:31:20] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: 393441b: Change configuration of AbuseFilter extension for enwikisource (T231750) (duration: 01m 04s) [08:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:23] T231750: (enwikisource) Abuse filter changes request: add abusefilter actions block + autoconfirmed to see abusefilter-log-detail - https://phabricator.wikimedia.org/T231750 [08:33:56] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson @Jclark-ctr there is one last problem - an-presto1005: 1) is not connected to any switch... [08:35:41] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [08:36:45] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) We may need some firmware updates, but hw is ready to go as soon as background raid initialization finishes on array2 of backup1001. Hosts installed with buster.... [08:36:47] (03CR) 10Gehel: [C: 03+2] wdqs: switch production clusters to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537634 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:36:51] (03PS3) 10Gehel: wdqs: switch production clusters to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537634 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:44:22] (03PS1) 10Vgutierrez: ATS,tlsproxy: ocsp parameter for acme_chief::cert is not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/537927 (https://phabricator.wikimedia.org/T232988) [08:44:37] (03PS1) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [08:46:34] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:46:47] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MoritzMuehlenhoff) 05Resolved→03Open @MGerlach : You're using the same key for production SSH access and Cloud VPS, which is insecure as Cloud VPS all... [08:47:40] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:59:58] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:01:56] 10Operations, 10observability: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10fgiunchedi) [observability hat on] I like the idea of being able to capture outstanding apache requests, possibly on demand / during incidents if we so desire. If we're either sampling `mod_status` for l... [09:02:28] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:09:01] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Remove update-ocsp.d leftovers [puppet] - 10https://gerrit.wikimedia.org/r/537923 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [09:09:10] (03PS2) 10Vgutierrez: acme_chief: Remove update-ocsp.d leftovers [puppet] - 10https://gerrit.wikimedia.org/r/537923 (https://phabricator.wikimedia.org/T232988) [09:10:11] (03CR) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [09:10:16] (03PS2) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [09:10:18] (03PS1) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [09:10:41] 10Operations, 10observability: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10Joe) One important detail: php-fpm has a slow log function we're using even right now (but still not collecting to logstash or doing anything with) that will not only register the url that's being slowly... [09:10:56] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:11:19] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:12:04] * Urbanecm stagging on mwdebug1002 [09:12:45] (03PS2) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) [09:13:12] (03CR) 10Vgutierrez: [C: 03+2] ATS,tlsproxy: ocsp parameter for acme_chief::cert is not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/537927 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [09:13:20] (03PS2) 10Vgutierrez: ATS,tlsproxy: ocsp parameter for acme_chief::cert is not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/537927 (https://phabricator.wikimedia.org/T232988) [09:15:57] (03CR) 10Jcrespo: "@akosiaris Thanks for the clean existing puppet code, that will make migration much easier. I am thinking of applying the puppet code firs" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:17:23] (03PS2) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [09:19:13] 10Operations, 10DBA: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10jcrespo) Reminder: Let's check grants too. [09:20:27] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) Sorry, no. We're not going to intentionally downgrade to an old, known-insecure version. If there's a... [09:22:38] !log power back on ms-be1027, found with power off [09:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> @akosiaris Thanks for the clean existing puppet code, that will make migration much easier. I am thinking of applying the puppet code fi" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:25:10] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Open→03Resolved @MoritzMuehlenhoff Added separate key for Cloud VPS. [09:29:50] PROBLEM - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:30:04] PROBLEM - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:30:06] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:30:26] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:30:28] PROBLEM - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:30:40] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:30:46] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:30:58] PROBLEM - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:31:00] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:31:01] silencing [09:31:10] reimage alert spam [09:32:00] 10Operations, 10ops-eqiad: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10fgiunchedi) [09:32:13] (03CR) 10Jcrespo: "> That might not be so easy as it sounds, as we have the archive pool as well to migrate. That is, the database does have data we don't wa" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:32:28] ACKNOWLEDGEMENT - Host ms-be1027 is DOWN: PING CRITICAL - Packet loss = 100% Filippo Giunchedi https://phabricator.wikimedia.org/T233289 [09:32:38] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T93886 [09:32:38] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T120662 [09:32:38] ACKNOWLEDGEMENT - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive Muehlenhoff bootstrap pending after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:32:38] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T93886 [09:32:38] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T120662 [09:32:39] ACKNOWLEDGEMENT - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Muehlenhoff bootstrap pending after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:32:39] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T93886 [09:32:40] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff bootstrap pending after reimage https://phabricator.wikimedia.org/T120662 [09:32:40] ACKNOWLEDGEMENT - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive Muehlenhoff bootstrap pending after reimage https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:38:15] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, 10Kubernetes: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) [09:44:04] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:45:28] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 2 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10MarcoAurelio) [09:47:07] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:50:22] jouncebot: now [09:50:22] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [09:50:53] (03CR) 10Marostegui: "I am fine with the new database creation, please make sure to include the grants on the production-m1.sql grants file, so we can also clea" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:51:07] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Open→03Declined It is not about downgrading Docker, but rather to keep the same version we are currently u... [09:51:10] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [09:51:52] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/CheckUser: security T207094 (duration: 01m 05s) [09:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, I 'll merge once @mobrovac or @ppchelko give a LGTM as well" [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [09:53:14] !log urbanecm@deploy1001 sync-file aborted: security T207094 (duration: 00m 28s) [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/CheckUser: security T207094 (duration: 01m 02s) [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:33] (03PS1) 10Gilles: Lower gzip threshold for SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) [10:03:50] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 2 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10hashar) @Jdforrester-WMF has added an experimental `helm-lint` job to the repository: T216049. It runs `help lint --strict charts/*/` :] [10:04:45] (03PS2) 10Gilles: Lower gzip threshold for SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) [10:07:06] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) >>! In T226236#5505608, @hashar wrote: > Anywa,y I am declining this and postpone the migration to St... [10:12:28] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:15:30] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:17:04] (03PS9) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [10:17:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:55] (03PS4) 10Volans: sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) [10:25:31] (03CR) 10Volans: "> Patch Set 3: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [10:27:34] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:29:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [10:34:59] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 2 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) >>! In T233291#5505631, @hashar wrote: > @Jdforrester-WMF has added an experimental `helm-lint` job to the repository: T216049. It... [10:35:20] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 2 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) p:05Triage→03High a:03Joe [10:36:01] (03PS1) 10Urbanecm: Add new throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537983 (https://phabricator.wikimedia.org/T233199) [10:36:18] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 2 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) [10:37:48] (03PS1) 10Giuseppe Lavagetto: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) [10:39:00] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:30] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) Real code: [`MIMEParse.java`](https://github.com/LinkedDat... [10:40:52] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:43:54] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:45:18] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add known websocket endpoints to the TLS instance mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [10:45:27] (03PS3) 10Vgutierrez: ATS: Add known websocket endpoints to the TLS instance mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) [10:47:58] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:53:28] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:53:43] I guess downtime expired [10:54:56] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:55:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comment inline, but otherwise probably ok for starters" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1100). [11:00:04] kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] \o [11:00:15] I can SWAT today! [11:00:21] Hi Urbanecm, thanks [11:00:48] kostajh: the calendar contains one patch twice [11:00:53] is that intentional? [11:01:00] mm, that's not very helpful. One sec [11:01:30] thanks [11:01:30] There we go [11:01:42] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable Special:Homepage for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537803 (https://phabricator.wikimedia.org/T233066) (owner: 10Kosta Harlan) [11:02:28] thanks again [11:02:54] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable help panel for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537804 (https://phabricator.wikimedia.org/T233065) (owner: 10Kosta Harlan) [11:04:27] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537983 (https://phabricator.wikimedia.org/T233199) (owner: 10Urbanecm) [11:04:28] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:05:58] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:07:50] (03Merged) 10jenkins-bot: GrowthExperiments: Enable Special:Homepage for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537803 (https://phabricator.wikimedia.org/T233066) (owner: 10Kosta Harlan) [11:08:09] (03CR) 10jenkins-bot: GrowthExperiments: Enable Special:Homepage for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537803 (https://phabricator.wikimedia.org/T233066) (owner: 10Kosta Harlan) [11:08:22] (03Merged) 10jenkins-bot: GrowthExperiments: Enable help panel for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537804 (https://phabricator.wikimedia.org/T233065) (owner: 10Kosta Harlan) [11:08:53] (03Merged) 10jenkins-bot: Add new throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537983 (https://phabricator.wikimedia.org/T233199) (owner: 10Urbanecm) [11:09:10] kostajh: both are on mwdebug1002 [11:09:17] Urbanecm: thanks, looking [11:10:14] (03CR) 10jenkins-bot: GrowthExperiments: Enable help panel for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537804 (https://phabricator.wikimedia.org/T233065) (owner: 10Kosta Harlan) [11:13:00] (03PS1) 10Vgutierrez: cache: Deploy ats-tls in the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/537993 (https://phabricator.wikimedia.org/T231627) [11:20:04] Urbanecm: looks good to me [11:21:19] kostajh: ok, thanks, syncing [11:22:12] (03PS2) 10Vgutierrez: ATS: Unmask trafficserver.service iff it's actually being used [puppet] - 10https://gerrit.wikimedia.org/r/537611 [11:22:14] (03PS2) 10Vgutierrez: cache: Deploy ats-tls in the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/537993 (https://phabricator.wikimedia.org/T231627) [11:23:15] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: eab7c6a: c80f026: GrowthExperiments: GrowthExperiments: Enable Special:Homepage for euwiki, GrowthExperiments: Enable help panel for euwiki (T233066, T233065) (duration: 01m 05s) [11:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:21] T233065: Deploy Help Panel to Basque Wikipedia - https://phabricator.wikimedia.org/T233065 [11:23:21] T233066: Deploy Newcomer Homepage to Basque Wikipedia - https://phabricator.wikimedia.org/T233066 [11:23:29] kostajh: done! [11:23:56] Urbanecm: thx [11:24:02] happy to help [11:26:05] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 199a05c: Add new throttle rule for Czech wiki course (T233199) (duration: 01m 01s) [11:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:09] T233199: Request temporary lift of IP cap for Czech course - https://phabricator.wikimedia.org/T233199 [11:29:42] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:30:51] (03PS3) 10Vgutierrez: ATS: Unmask trafficserver.service iff it's actually being used [puppet] - 10https://gerrit.wikimedia.org/r/537611 [11:30:53] (03PS3) 10Vgutierrez: cache: Deploy ats-tls in the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/537993 (https://phabricator.wikimedia.org/T231627) [11:32:00] !log EU SWAT done [11:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:42:20] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:42:34] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1001/18428/" [puppet] - 10https://gerrit.wikimedia.org/r/537993 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [11:46:58] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:47:40] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:48:29] !log bootstrap restbase2012-a -- T224553 [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [11:49:06] RECOVERY - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-a valid until 2020-06-24 13:02:01 +0000 (expires in 279 days) https://phabricator.wikimedia.org/T120662 [11:56:20] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5007 [puppet] - 10https://gerrit.wikimedia.org/r/537994 (https://phabricator.wikimedia.org/T231627) [11:56:22] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp5007 [puppet] - 10https://gerrit.wikimedia.org/r/537995 (https://phabricator.wikimedia.org/T231627) [12:00:08] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) restbase2012 has been reimaged and is ready to be bootstrapped in Cassandra. All jessie Cassandra instances gone! [12:08:37] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Cmjohnson) This server is out of warranty and @robh has created a procurement task. [12:11:47] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) >>! In T233273#5505877, @Cmjohnson wrote: > This server is out of warranty and @robh has created a procurement task. Indeed Chris - thanks! I caught up with Willy earlier today and... [12:12:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) [12:12:32] 10Operations, 10ops-eqiad, 10netops: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10Cmjohnson) [12:16:48] 10Operations, 10ops-eqiad: Verify switch port connections - https://phabricator.wikimedia.org/T233302 (10Cmjohnson) [12:22:21] (03PS1) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/538003 (https://phabricator.wikimedia.org/T230783) [12:22:27] (03PS1) 10Marostegui: wmnet: Update s3-master alias to point to db1078 [dns] - 10https://gerrit.wikimedia.org/r/538004 (https://phabricator.wikimedia.org/T230783) [12:22:55] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/538003 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [12:23:16] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/538004 (https://phabricator.wikimedia.org/T230783) (owner: 10Marostegui) [12:36:14] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:44] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7f4b7f7]: Start using RESTBase built on Stretch - T224553 [12:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [12:40:34] PROBLEM - Host mw1300 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:02] looking into mw1300 [12:53:36] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [12:54:48] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [12:55:04] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:57:52] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [12:59:03] just wondering if something is wrong with https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/NRdk and other rename (which is over 100K) [12:59:05] PROBLEM - Host ms-be1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:00:47] oh.... it was few days ago [13:01:22] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7f4b7f7]: Start using RESTBase built on Stretch - T224553 (duration: 21m 38s) [13:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:25] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [13:05:48] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:07:16] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:07:45] (03Abandoned) 10Filippo Giunchedi: ci: add statsd_exporter for zuul/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) (owner: 10Filippo Giunchedi) [13:08:41] (03PS1) 10Jbond: profile::icinga: update to use lookup instead of hiera [puppet] - 10https://gerrit.wikimedia.org/r/538018 [13:08:44] RECOVERY - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.68 port 9042 https://phabricator.wikimedia.org/T93886 [13:08:45] (03PS1) 10Jbond: profile::icinga: Add apereo_cas authenticated vhost [puppet] - 10https://gerrit.wikimedia.org/r/538019 [13:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1089 into contributions service T223151', diff saved to https://phabricator.wikimedia.org/P9133 and previous config saved to /var/cache/conftool/dbconfig/20190919-130848-marostegui.json [13:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:52] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [13:12:48] !log bootstrap restbase2012-b -- T224553 [13:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [13:13:40] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2020-06-24 13:02:02 +0000 (expires in 278 days) https://phabricator.wikimedia.org/T120662 [13:13:48] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:14:57] !log powercycling mw1300 [13:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] RECOVERY - Host mw1300 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:19:31] jouncebot: now [13:19:31] No deployments scheduled for the next 2 hour(s) and 40 minute(s) [13:19:32] jouncebot: next [13:19:33] In 2 hour(s) and 40 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1600) [13:20:26] Urbanecm: btw https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/NRdk is stuck for 2 days, FYI [13:20:46] just letting you know since you've been doing renames [13:20:54] revi: thanks [13:21:14] and better positioned to poke the ops than me (who is going to sleep in few hours) [13:21:59] revi: I'll create a task soon :) [13:22:18] RECOVERY - Host ms-be1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 346.64 ms [13:22:26] there's no sign why mw1300 went down in SEL or syslog/kern.log, but after a power cycle it's back up fine again [13:27:04] trying to move a page on mw.o, i'm told that i have exceeded the throttle limit for page moves [13:27:11] i havn't moved a page in ages [13:27:15] what gives? [13:28:23] (03CR) 10jenkins-bot: Add new throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537983 (https://phabricator.wikimedia.org/T233199) (owner: 10Urbanecm) [13:41:18] PROBLEM - Host ms-be1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:43:55] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/Translate: T233308 (duration: 01m 07s) [13:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:59] T233308: Special:MovePage WMFTimeoutException from line 39 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded - https://phabricator.wikimedia.org/T233308 [13:46:57] RECOVERY - Host ms-be1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [13:51:28] !log mholloway-shell@deploy1001 Started deploy [recommendation-api/deploy@c8abb0f]: Article recommendation API: replace WDQS with MW API (T216750) [13:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:31] T216750: Article recommendation API: replace WDQS with MW API - https://phabricator.wikimedia.org/T216750 [13:51:52] (03CR) 10Volans: [C: 03+1] "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [13:53:44] Hey all - would like to security-deploy the patch for T224203 (AbuseFilter) in a few minutes. Let me know if I shouldn't. [13:54:09] (03PS10) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [13:54:11] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode during designate upgrade [puppet] - 10https://gerrit.wikimedia.org/r/538027 (https://phabricator.wikimedia.org/T212302) [13:54:14] (03PS1) 10Andrew Bogott: Designate: move to OpenStack version 'newton' [puppet] - 10https://gerrit.wikimedia.org/r/538028 (https://phabricator.wikimedia.org/T212302) [13:54:16] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode during designate upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/538029 [13:54:34] !log mholloway-shell@deploy1001 Finished deploy [recommendation-api/deploy@c8abb0f]: Article recommendation API: replace WDQS with MW API (T216750) (duration: 03m 06s) [13:54:40] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:01] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10Aklapper) @DannyS712: When doing what exactly? Viewing? Editing? Moving? Something else? [13:56:49] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode during designate upgrade [puppet] - 10https://gerrit.wikimedia.org/r/538027 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [13:56:51] (03CR) 10Volans: [C: 03+2] "Merging to test it with some host to decommission" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [13:57:04] (03CR) 10Andrew Bogott: [C: 03+2] Designate: move to OpenStack version 'newton' [puppet] - 10https://gerrit.wikimedia.org/r/538028 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [13:57:44] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:58:29] (03Merged) 10jenkins-bot: sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [14:00:14] (03PS1) 10Marostegui: index-conf.yaml: Remove unused index [puppet] - 10https://gerrit.wikimedia.org/r/538030 (https://phabricator.wikimedia.org/T233135) [14:01:30] (03PS2) 10Andrew Bogott: Revert "Horizon: put into maintenance mode during designate upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/538029 [14:01:32] (03PS11) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [14:01:34] (03PS1) 10Andrew Bogott: Designate: upgrade eqiad1 to Newton [puppet] - 10https://gerrit.wikimedia.org/r/538031 (https://phabricator.wikimedia.org/T212302) [14:02:52] (03CR) 10Andrew Bogott: [C: 03+2] Designate: upgrade eqiad1 to Newton [puppet] - 10https://gerrit.wikimedia.org/r/538031 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [14:05:40] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:07:10] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:16:59] !log mobrovac@deploy1001 Started deploy [restbase/deploy@44f4c79]: Remove the TID suffix in the ETag, if present - T230272 [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] T230272: 404 error when using VisualEditor: apierror-visualeditor-docserver-http - https://phabricator.wikimedia.org/T230272 [14:17:08] (03CR) 10Jhedden: [C: 03+1] index-conf.yaml: Remove unused index [puppet] - 10https://gerrit.wikimedia.org/r/538030 (https://phabricator.wikimedia.org/T233135) (owner: 10Marostegui) [14:17:10] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode during designate upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/538029 (owner: 10Andrew Bogott) [14:17:49] (03PS2) 10Marostegui: index-conf.yaml: Remove unused index [puppet] - 10https://gerrit.wikimedia.org/r/538030 (https://phabricator.wikimedia.org/T233135) [14:18:27] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:00] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [14:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:50] !log Deployed security patch for T224203 [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] (03CR) 10Marostegui: [C: 03+2] index-conf.yaml: Remove unused index [puppet] - 10https://gerrit.wikimedia.org/r/538030 (https://phabricator.wikimedia.org/T233135) (owner: 10Marostegui) [14:23:02] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:24:01] (03PS1) 10Jbond: apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 [14:24:27] I'm running some ripe atlas tests on the blog just to check if it is widespread issue or local, doubt it we can do anything about it so I might just silence it for 24h [14:26:00] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [14:26:16] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10jijiki) 05Stalled→03Resolved We are planning to move Thumbor to k8s, T233196, thus I am closing this task [14:27:11] (03PS14) 10Filippo Giunchedi: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [14:27:40] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 43 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:28:03] oh yeah totally shows up in atlas too [14:28:09] !log Deployed security patch for T224203 (php-1.34.0-wmf.23) [14:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:11] https://atlas.ripe.net/measurements/22878620/#!probes [14:28:19] !log mobrovac@deploy1001 deploy aborted: Remove the TID suffix in the ETag, if present - T230272 (duration: 11m 20s) [14:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:22] T230272: 404 error when using VisualEditor: apierror-visualeditor-docserver-http - https://phabricator.wikimedia.org/T230272 [14:28:29] (03PS2) 10Jbond: apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 [14:28:30] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [14:28:34] !log mobrovac@deploy1001 Started deploy [restbase/deploy@44f4c79]: Remove the TID suffix in the ETag, if present, take #2 [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:02] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:30:29] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [14:30:56] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 https://phabricator.wikimedia.org/T93886 [14:30:58] RECOVERY - cassandra-c service on restbase2012 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:31:02] !log bootstrap restbase2012-c -- T224553 [14:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:15] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [14:31:28] (03PS3) 10Jbond: apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 [14:31:44] RECOVERY - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-c valid until 2020-06-24 13:02:03 +0000 (expires in 278 days) https://phabricator.wikimedia.org/T120662 [14:32:34] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:32:36] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (re [14:32:36] mage data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:32:36] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:32:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:34:06] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:34:06] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:34:10] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:34:37] effie: thanks for taking care of the blog alert! [14:34:57] (03CR) 10Ppchelko: [C: 03+1] RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:35:06] I'm assuming the mobileapps alerts were related to cassandra bootstraps? mobrovac ? [14:35:26] godog: nope, related to mobileapps flapping unfortunately [14:35:37] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:35:59] mobrovac: ah! thanks, known and/or sth actionable ? [14:36:43] godog: known, not actionable for the moment, we are formulating the strategy with mcs folks (trying to find a work-around for it) [14:36:50] well, "work-around" :P [14:36:58] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@44f4c79]: Remove the TID suffix in the ETag, if present, take #2 (duration: 08m 24s) [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@44f4c79]: Remove the TID suffix in the ETag, if present, take #3 [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:30] mobrovac: lol [14:44:28] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [14:45:04] (03PS1) 10Reedy: Purge CheckUser logs every day rather than weekly [puppet] - 10https://gerrit.wikimedia.org/r/538037 [14:45:08] jouncebot: now [14:45:08] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [14:45:10] jouncebot: net [14:45:11] jouncebot: next [14:45:11] In 1 hour(s) and 14 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1600) [14:46:21] (03CR) 10SBassett: [C: 03+1] "To better mitigate T216794" [puppet] - 10https://gerrit.wikimedia.org/r/538037 (owner: 10Reedy) [14:46:43] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) [14:47:45] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@44f4c79]: Remove the TID suffix in the ETag, if present, take #3 (duration: 10m 42s) [14:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:52] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10MoritzMuehlenhoff) [14:47:54] (03CR) 10CRusnov: [C: 03+2] ganeti: Add ability to get ganeti cluster for given instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [14:48:41] (03PS2) 10Alexandros Kosiaris: RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:48:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:48:59] (03CR) 10jenkins-bot: ganeti: Add ability to get ganeti cluster for given instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [14:51:07] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286) [14:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [14:51:38] 10Operations, 10netops: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10Papaul) [14:56:44] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286) (duration: 05m 39s) [14:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:47] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [14:57:15] (03PS1) 10Muehlenhoff: Decom lithium [puppet] - 10https://gerrit.wikimedia.org/r/538040 (https://phabricator.wikimedia.org/T229557) [14:58:33] (03CR) 10Muehlenhoff: [C: 03+2] Decom lithium [puppet] - 10https://gerrit.wikimedia.org/r/538040 (https://phabricator.wikimedia.org/T229557) (owner: 10Muehlenhoff) [15:00:19] (03PS3) 10Alexandros Kosiaris: RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [15:00:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [15:01:33] (03PS1) 10Muehlenhoff: Remove DNS entries for lithium [dns] - 10https://gerrit.wikimedia.org/r/538041 (https://phabricator.wikimedia.org/T229557) [15:02:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for lithium [dns] - 10https://gerrit.wikimedia.org/r/538041 (https://phabricator.wikimedia.org/T229557) (owner: 10Muehlenhoff) [15:03:44] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) a:05RobH→03MoritzMuehlenhoff Claiming the task for another test of the updated decom cookbook. [15:05:19] (03PS1) 10Jcrespo: bacula: Make bacula db parameters configurable on hiera [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) [15:05:43] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [15:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:09] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:15] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `iron.wikimedia.org` - iron.wikimedia.org (**PASS**) - Downtimed host on Icinga - Downtimed... [15:14:47] 10Operations, 10netops, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10ayounsi) p:05Triage→03Normal [15:15:09] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:21:28] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10mobrovac) 05Open→03Resolved a:03mobrovac This has now been completed. Thank you @MoritzMuehlenhoff for assisting and promptly re-imag... [15:21:30] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10mobrovac) [15:21:44] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10mobrovac) [15:23:09] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10MoritzMuehlenhoff) I ran another test with iron (T220505) and it worked fine as well: Puppetdb/Debmonitor entries were removed, the Puppet cert revoked, Netbox was correctly updated to "Decomiss... [15:25:19] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=restbase,service=restbase,dc=codfw,name=restbase2012.codfw.wmnet [15:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:23] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=restbase,service=restbase-ssl,dc=codfw,name=restbase2012.codfw.wmnet [15:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:27] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=restbase,service=restbase-backend,dc=codfw,name=restbase2012.codfw.wmnet [15:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:30] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=restbase,service=cassandra,dc=codfw,name=restbase2012.codfw.wmnet [15:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:02] !log repooling restbase2012 after completed Cassandra bootstrap T224553 [15:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:04] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [15:27:26] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) [15:27:47] (03CR) 10Jcrespo: "Will run puppet compiler tomorrow to verify its correctness." [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) (owner: 10Jcrespo) [15:29:50] (03PS1) 10Elukey: Add bacula backups for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) [15:30:19] (03PS1) 10Muehlenhoff: Remove remaining Puppet references for iron [puppet] - 10https://gerrit.wikimedia.org/r/538046 (https://phabricator.wikimedia.org/T220505) [15:31:27] (03PS1) 10Volans: sre.hosts.decomission: improve logging in the console [cookbooks] - 10https://gerrit.wikimedia.org/r/538047 [15:33:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet references for iron [puppet] - 10https://gerrit.wikimedia.org/r/538046 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [15:35:36] (03CR) 10Elukey: "Jaime: started the change as we discussed, even if it is probably completely wrong :)" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [15:35:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:35:55] (03PS1) 10Volans: Add Papaul to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/538048 (https://phabricator.wikimedia.org/T233189) [15:36:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Ops Group for papaul@ - https://phabricator.wikimedia.org/T233189 (10Volans) p:05Triage→03Normal a:05Volans→03faidon Patch ready, pending approval. [15:36:27] moritzm: is that you for the unmerged puppet changes? [15:36:39] (03PS1) 10Muehlenhoff: Remove DNS entry for iron [dns] - 10https://gerrit.wikimedia.org/r/538049 (https://phabricator.wikimedia.org/T220505) [15:36:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Make bacula db parameters configurable on hiera [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) (owner: 10Jcrespo) [15:37:20] no, I had puppet-merged, maybe odd race [15:37:32] re-rerunning shows "no changes to merge" [15:39:10] RECOVERY - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.70 port 9042 https://phabricator.wikimedia.org/T93886 [15:40:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add bacula backups for Matomo and Analytics meta (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [15:40:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entry for iron [dns] - 10https://gerrit.wikimedia.org/r/538049 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [15:41:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) [15:41:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH Back to Rob for switch port removal [15:43:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10RobH) a:05RobH→03Cmjohnson ready for disk wipe and unracking [15:44:19] (03CR) 10Nuria: "argh, sorry, Luca, i tested it on the host but totally missed options had not been copied" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) (owner: 10Nuria) [15:45:55] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [15:50:33] akosiaris: see what happens sending code reviews before meetings :P [15:50:37] thanks, fixing it now [15:54:30] (03CR) 10Jcrespo: "The general idea seems ok (I have not yet gone line-by-line)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [16:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1600). [16:00:04] reedy: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:16] * Reedy looks around shiftily [16:00:26] elukey: :-) [16:01:04] 10Operations, 10netops, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH) Just FYI it seems the serial console's have some built in nagios support. I've attached a print out of the nagios configuration screen below. {F30398437} [16:01:43] (03CR) 10Anomie: index-conf.yaml: Remove unused index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538030 (https://phabricator.wikimedia.org/T233135) (owner: 10Marostegui) [16:02:23] 10Operations, 10Icinga, 10netops, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10RobH) [16:02:41] (03PS2) 10Elukey: Add bacula backups for Matomo and Analytics meta [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) [16:12:03] Is anyone doing puppet swat? :P [16:12:24] (03PS4) 10Alexandros Kosiaris: scaffold: Fix bug with concatenation of args/command [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 [16:12:26] (03CR) 10Alexandros Kosiaris: scaffold: Fix bug with concatenation of args/command (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 (owner: 10Alexandros Kosiaris) [16:12:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Indeed. Fixed, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 (owner: 10Alexandros Kosiaris) [16:15:07] !log shutting down scs-a1-codfw for replacement [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Jgreen) [16:15:50] <_joe_> Reedy: I will in a few mins, wrapping up meetings [16:15:55] cheers :) [16:16:02] <_joe_> sorry I thought I told you :/ [16:16:07] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10RobH) p:05High→03Normal [16:16:11] <_joe_> and apparently I didn't [16:17:32] heh, no worries [16:17:44] (03PS2) 10Giuseppe Lavagetto: Purge CheckUser logs every day rather than weekly [puppet] - 10https://gerrit.wikimedia.org/r/538037 (owner: 10Reedy) [16:18:10] <_joe_> Reedy: is there a task related to the change? [16:18:24] https://phabricator.wikimedia.org/T216794 [16:19:02] (03CR) 10Halfak: [C: 03+1] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:19:41] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10RobH) a:05RobH→03Cmjohnson Ready for disk wipes and continued decom process. decom system, dont return to spares. [16:20:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Purge CheckUser logs every day rather than weekly [puppet] - 10https://gerrit.wikimedia.org/r/538037 (owner: 10Reedy) [16:21:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:21:23] (03PS7) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:24:08] <_joe_> Reedy: done, but for some reason nothing changed on mwmaint1002 [16:24:21] Hmm [16:24:25] <_joe_> lemme see [16:24:33] Should it? Or should I have set the day thing to be some value to mean every day? [16:24:57] <_joe_> possibly, that cron resource is tricky [16:25:06] <_joe_> lemme dig into it [16:30:15] <_joe_> Reedy: removing the weekday just means that is not managed by puppet [16:30:29] aha [16:31:15] Can we use cron::daily? [16:31:19] <_joe_> !log removed manually the purge_checkuser cron from mwmaint1002, to have puppet recreate it [16:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:23] (03CR) 10Elukey: "Thanks for both reviews!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [16:31:25] (03CR) 10Cwhite: [C: 03+1] "Looks good to me. Graphite tool shows 100% metrics coverage and the naming looks consistent." [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [16:31:35] <_joe_> no, don't bother, we will move to systemd timers soon enough [16:31:41] lol [16:32:04] <_joe_> # Puppet Name: purge-checkuser [16:32:06] <_joe_> 0 0 * * * /usr/local/bin/foreachwiki extensions/CheckUser/maintenance/purgeOldData.php >/dev/null 2>&1 [16:32:46] So it worked, just needed to force puppet to recreate the cron entry? [16:33:15] (03CR) 10Alexandros Kosiaris: "Hm that might cause a duplicate declaration in some cases, /me having another look" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:34:17] <_joe_> Reedy: yes [16:34:35] cool, thanks :) [16:34:39] <_joe_> because of what I told you, we're just not managing the weekday anymore [16:34:53] <_joe_> ok, swat done I'd say :P [16:36:18] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10DannyS712) Viewing a diff, I believe [16:37:49] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10herron) [16:37:50] (03PS1) 10Dmaza: Enable SpecialMute feature on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538063 (https://phabricator.wikimedia.org/T231577) [16:40:19] 10Operations, 10ops-eqiad: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10wiki_willy) a:03Cmjohnson [16:43:25] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10herron) @greg could you please review/approve this request for deployment permissions? @Andrew-WMDE could you please coordinate obtaining a comment of approval here from your sup... [16:44:22] 10Operations, 10ops-eqiad: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10Cmjohnson) John checked on this first thing this morning, first thing. The power light was blinking green but are not getting any power. I had him reseat and drain flea power. That did not work. He then too... [16:55:15] (03CR) 10Hashar: "check experimental" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [16:57:32] Reedy: no deploy, correct? [16:57:38] Hm? [16:57:39] (looks already done) [16:57:43] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1600 [16:57:58] Yeah, was just the puppet patch [16:58:03] also, different swat [16:58:04] nmv [16:58:06] nvm* [16:59:21] (03CR) 10Volans: "This can be abandoned, the same feature has been already implemented in I531d8f25c80cda455705e5010c8c78115856cb67" [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 (owner: 10CRusnov) [16:59:37] (03CR) 10Hashar: "You would need the rake task to fail() / raise() whenever there are failure. Right now it exits 0 :]" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1700). [17:00:10] (03Abandoned) 10CRusnov: decommission: Add Netbox state change [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 (owner: 10CRusnov) [17:01:46] (03PS1) 10Cmjohnson: Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) [17:02:14] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:04:08] (03CR) 10Volans: "Question inline" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [17:05:16] (03PS1) 10Cmjohnson: old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) [17:05:57] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jayantanth) Could you please anyone import all the pages (ns:0)/Index files/Pages? [17:06:37] (03CR) 10Cmjohnson: [C: 03+2] old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:07:31] (03Abandoned) 10Cmjohnson: old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:08:18] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@69b3737]: Update mobileapps to cfc3062 [17:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:48] (03PS2) 10Cmjohnson: Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) [17:09:12] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:11:37] (03Restored) 10Cmjohnson: old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:12:25] (03PS5) 10Ayounsi: Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [17:13:59] (03CR) 10Hashar: "So some random questions:" [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [17:14:00] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@69b3737]: Update mobileapps to cfc3062 (duration: 05m 42s) [17:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:20] (03PS2) 10BBlack: codfw backup LVS: BGP sessions with both routers [puppet] - 10https://gerrit.wikimedia.org/r/536324 (https://phabricator.wikimedia.org/T165765) [17:16:37] !log lvs200[456] - puppet disabled for https://gerrit.wikimedia.org/r/536324 deploy/test [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:30] (03CR) 10BBlack: [C: 03+2] codfw backup LVS: BGP sessions with both routers [puppet] - 10https://gerrit.wikimedia.org/r/536324 (https://phabricator.wikimedia.org/T165765) (owner: 10BBlack) [17:19:36] (03CR) 10Ayounsi: [C: 03+2] Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [17:19:48] (03PS6) 10Ayounsi: Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [17:19:54] !log lvs2006 - restart pybal for deploy/test of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536324/ [17:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:11] * Krinkle staging on mwdebug1002 [17:21:16] (03Abandoned) 10Cmjohnson: Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:21:35] (03PS2) 10Cmjohnson: old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) [17:21:38] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] old mgmt entry that was never removed for wmf3151 [dns] - 10https://gerrit.wikimedia.org/r/538070 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:22:15] (03Restored) 10Cmjohnson: Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:22:22] (03PS3) 10Cmjohnson: Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) [17:23:44] !log lvs2005 - restart pybal for deploy/test of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536324/ [17:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:07] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns entries for new elastic servers [dns] - 10https://gerrit.wikimedia.org/r/538069 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [17:27:00] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.23/includes/libs/objectcache/wancache: 2e910c9d3f8c04f7db, T232907 (duration: 01m 03s) [17:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:05] T232907: StatsD metrics for WANObjectCache misreported for key components containing a dot - https://phabricator.wikimedia.org/T232907 [17:27:07] !log lvs2006 - restart pybal for deploy/test of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536324/ [17:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:23] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10Krinkle) [17:28:43] bblack: nice, looks interesting! /me reads more about BGP [17:28:56] (03PS1) 10Bstorm: wiki replicas: Switch to using VariantSettings.php for now [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) [17:29:30] Krinkle: the TL;DR here is in the past, the pybal daemons on the LVS machines could each only be connected to one router. So we have redundant pybals and redundant routers, but they're not independently redundant (e.g. losing 1/2 routers also loses half our LVSes, unless we manually reconfigure things). [17:30:10] Krinkle: but pybal eventually got the feature for multiple bgp peers (quite some time ago, which I lost track of), and we're finally trying out configurations where all LVSes talk to all routers, thus making their redundancy independent of each other. [17:31:12] bblack: yeah. if it's easy to answer - what in this context does it mean for an LVS server to be connected to a router? That this router will sent all traffic for that virtual IP to that physical server? [17:32:03] it's not about the physical L2 or even L3 reachability, it's about how the routes for the service IPs are advertised, so... [17:32:27] 10Operations, 10Performance-Team, 10SRE-Access-Requests, 10Patch-For-Review: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10greg) >>! In T232489#5483178, @jbond wrote: > @greg are you able to approve this access request Sorry for the late reply, yes... [17:32:33] e.g. in codfw, lvs2001 and lvs2003 both have the text-lb.codfw IP, which lvs2001 serving as the primary destination when both are up. [17:33:00] both lvs2001 and lvs2003 are physically connected to both the cr1 and cr2 routers through our normal datacenter vlans, so they can both in theory talk to both routers. [17:33:33] but lvs2001 would only make a bgp connection to cr1-eqiad to advertise the text-lb route, and lvs2003 would only make a bgp connection to cr2-eqiad to advertise its (backup) text-lb route [17:34:03] I see. [17:34:17] cr1-eqiad and cr2-eqiad also share these learned routes with each other, so when everything is online, both routers know about their routing options to directly contact lvs2001 or lvs2003 to handle text-lb as necessary. [17:34:47] from the outside world, what makes one router the primary and the other the backup? Is another another layer of BGP elsewhere? (seems to be paradoxal to me, but I'm only just learning about it so..) [17:34:50] but if, for instance, the cr1-eqiad router died, the cr2 would still have a physical connection to lvs2001, but would no longer see any BGP advertisement for text-lb from it (since that was coming indirectly through cr1) [17:34:57] ah OK, that's the missing bit [17:35:02] they tell each other. [17:35:24] this is all internal stuff, nothing to do with external-facing BGP or primacy between the hardware routers [17:35:42] (03CR) 10BryanDavis: [C: 03+1] "I think this will work, but I left an inline comment about a possible longer lasting band-aid." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) (owner: 10Bstorm) [17:36:06] so now in the above scenario, both lvs2001 and lvs2003 establish bgp connections to both of cr1 and cr2, and so losing a router doesn't lose half the adverts to the other router [17:36:15] (and similar inverse scenarios if we lose an LVS machine, etc) [17:36:16] bblack: so at this micro level, is this use of BGP basically just the mapping of a generic service IP to a specific physical host one hop away? Or is it more than that? [17:36:51] yes, basically. [17:37:23] cool [17:37:25] the official public service IPs like text-lb, live in a different "subnet" - they're not part of our ethernet vlans that real hosts live in, and there is no natural vlan or whatever for routers to send traffic destined for them towards [17:37:47] something has to tell our routers what to do with text-lb packets, and the pybal/LVS machines use BGP to let the routers know that they're destinations for it [17:38:38] !log arlolra@deploy1001 Started deploy [parsoid/deploy@77630c5]: Updating Parsoid to 6bf23c2 [17:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:51] (and then pybal/LVS routes those inbound packets out to a cluster of actual service hosts, e.g. the varnish/ats clusters. But the return trip doesn't use LVS, the outward bound side goes straight from those inner cluster machines back to the hardware routers) [17:39:33] (03PS2) 10Dmaza: Enable SpecialMute feature on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538063 (https://phabricator.wikimedia.org/T231577) [17:40:19] bblack: Right, I was wonder how that worked. I previously convinced myself that using LVS on a server basically turns it into a router from network POV, but sounds like that's not entirely the case given there's also an "actual" router at play. [17:40:42] (03PS4) 10Dzahn: site: allocate mw1298 as a jobrunner, add to conftool [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) [17:40:45] it is effectively a router, it's just a special-purpose one, and only used for one direction of traffic flow [17:40:59] it's directly connected to all the vlans and routes traffic between them and the hardware routers, etc [17:42:11] (03CR) 10Dzahn: [C: 03+2] site: allocate mw1298 as a jobrunner, add to conftool [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [17:42:20] (03PS5) 10Dzahn: site: allocate mw1298 as a jobrunner, add to conftool [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) [17:43:06] !log Move whisper/MediaWiki/wanobjectcache/revision_row_1/29 to whisper/MediaWiki/wanobjectcache/revision_row_1_29 on graphite1004 and graphite2003 (T232907) [17:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:09] T232907: StatsD metrics for WANObjectCache misreported for key components containing a dot - https://phabricator.wikimedia.org/T232907 [17:43:52] (03PS2) 10Bstorm: wiki replicas: Switch to using VariantSettings.php for now [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) [17:47:03] (03PS1) 10Jforrester: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 [17:47:05] (03CR) 10BryanDavis: [C: 03+1] wiki replicas: Switch to using VariantSettings.php for now [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) (owner: 10Bstorm) [17:47:05] herron hi, around? [17:47:30] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@77630c5]: Updating Parsoid to 6bf23c2 (duration: 08m 52s) [17:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:01] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/AbuseFilter/includes/: T156095, 32cf50453cd (duration: 01m 04s) [17:48:02] (03PS3) 10Bstorm: wiki replicas: Switch to using VariantSettings.php for now [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) [17:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:04] T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe - https://phabricator.wikimedia.org/T156095 [17:48:26] paladox: hey [17:48:45] herron would you be able to help us update https://github.com/wikimedia/puppet/commit/b8d0a9764c9465d853d9aec37d9383da31b83b28#diff-6a500e5a9001daa876354f5d078f4059R109 to support both _log and .log please? [17:48:58] We updated the error_log to be gerrit.log [17:49:08] but the rest of the logs are using _log still. [17:49:37] (03PS5) 10Jforrester: [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [17:50:25] paladox: sure, do you already have a patch? [17:50:31] nope [17:50:38] i wasen't sure how to do that. [17:50:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:51:12] cc thcipriani ^ [17:52:03] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [17:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:16] ok I’ll upload something, should be simple [17:52:22] herron: indeed. I'm a little confused. We have https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/gerrit/server.pp#110 but that no longer appears to be used by logstash :) [17:52:49] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Switch to using VariantSettings.php for now [puppet] - 10https://gerrit.wikimedia.org/r/538075 (https://phabricator.wikimedia.org/T219374) (owner: 10Bstorm) [17:53:00] that rsyslog line is, however, point to the correct structured file [17:53:14] (and I see what I thought was being logged in the syslog, FWIW) [17:55:11] !log puppetmaster1001 - add mcrouter cert for mw1298.eqiad.wmnet (T192457) [17:55:13] ok, so sounds like two things. for gerrit.log, could that follow the _log convention as the rest of the log files in /var/log/gerrit do? [17:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:14] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [17:55:24] and looking at the json log now [17:57:04] (03PS15) 10Filippo Giunchedi: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [17:57:31] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Zoranzoki21) >>! In T218155#5506935, @jayantanth wrote: > Could you please anyone import all the pages (ns:0)/Index files/Pages? @MF-Warburg... [17:57:36] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ankry) >>! In T218155#5506935, @jayantanth wrote: > Could you please anyone import all the pages (ns:0)/Index files/Pages? Importing Index fi... [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1800). [18:00:04] dmaza: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:13] I'm here [18:01:33] !log lvs2004 - restart pybal for deploy/test of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536324/ [18:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:33] (03PS1) 10Dzahn: add fake mcrouter certs for mw1297 and mw1298 [labs/private] - 10https://gerrit.wikimedia.org/r/538080 [18:05:55] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Bstorm) [18:06:15] (03PS6) 10Jforrester: [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [18:07:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:08:00] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mw1297 and mw1298 [labs/private] - 10https://gerrit.wikimedia.org/r/538080 (owner: 10Dzahn) [18:08:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr [18:12:51] !log add TCP-MSS 1436 to cr1-eqiad external interfaces - T232602 [18:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:54] T232602: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 [18:14:04] (03CR) 10Filippo Giunchedi: "Thanks for taking the time!" [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [18:14:45] !log add TCP-MSS 1436 to cr2-eqiad external interfaces - T232602 [18:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:18:46] (03PS1) 10Herron: kafka_shipper: try parsing syslog messages as raw json [puppet] - 10https://gerrit.wikimedia.org/r/538081 [18:18:59] looking [18:19:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Cmjohnson) @Jclark-ctr please wipe, remove, update tracking and netbox. [18:20:29] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 46 probes of 505 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:20:36] (03PS1) 10Herron: logstash: add gerrit json log file to kafka output table [puppet] - 10https://gerrit.wikimedia.org/r/538082 [18:22:35] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10Dzahn) 05Resolved→03Declined [18:22:57] PROBLEM - mcrouter process on mw1298 is CRITICAL: NRPE: Command check_mcrouter not defined https://wikitech.wikimedia.org/wiki/Mcrouter [18:23:09] Anybody swating for this window? [18:23:54] (03PS2) 10Herron: logstash: add gerrit json log file to kafka output table [puppet] - 10https://gerrit.wikimedia.org/r/538082 [18:24:00] (03PS7) 10Jforrester: [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [18:24:37] (03PS3) 10Herron: logstash: add gerrit json log file to kafka output table [puppet] - 10https://gerrit.wikimedia.org/r/538082 [18:25:37] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 505 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:26:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:26:55] PROBLEM - nutcracker process on mw1298 is CRITICAL: NRPE: Command check_nutcracker not defined https://wikitech.wikimedia.org/wiki/Nutcracker [18:27:23] paladox: uploaded some patches, will give it some time to gather feedback on the approach [18:27:25] PROBLEM - Check systemd state on mw1298 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:57] PROBLEM - nutcracker socket on mw1298 is CRITICAL: NRPE: Command check_nutcracker_socket not defined https://wikitech.wikimedia.org/wiki/Nutcracker [18:30:28] that one is me ^ [18:30:28] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) * Setting tcp-mss on an interface causes all the BGP sessions going over that interface to bounce * As eqiad and codfw exchange a full view, some outbound eqiad traffic goes through codfw so we shoul... [18:31:01] PROBLEM - php7.2-fpm service on mw1298 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:31:46] mw1298 is still applying puppet role [18:32:23] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10MF-Warburg) >>! In T218155#5507138, @Zoranzoki21 wrote: >>>! In T218155#5506935, @jayantanth wrote: >> Could you please anyone import all the... [18:34:39] PROBLEM - Nginx local proxy to jobrunner on mw1298 is CRITICAL: connect to address 10.64.16.63 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [18:36:37] PROBLEM - Nginx local proxy to videoscaler on mw1298 is CRITICAL: connect to address 10.64.16.63 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [18:37:08] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) I thought he might be in training to join ... ;) [18:38:59] PROBLEM - PHP opcache health on mw1298 is CRITICAL: NRPE: Command check_opcache not defined https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:39:29] PROBLEM - PHP7 jobrunner on mw1298 is CRITICAL: connect to address 10.64.16.63 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [18:39:35] PROBLEM - mcrouter certs expiration check on mw1298 is CRITICAL: NRPE: Command check_mcrouter_cert_expiration not defined https://wikitech.wikimedia.org/wiki/Mcrouter [18:42:29] PROBLEM - PHP7 rendering on mw1298 is CRITICAL: connect to address 10.64.16.63 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:45:25] herron thanks! [18:46:11] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/538081 (owner: 10Herron) [18:46:33] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/538082 (owner: 10Herron) [18:46:39] RECOVERY - Nginx local proxy to jobrunner on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 340 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:46:40] RECOVERY - nutcracker process on mw1298 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:46:43] RECOVERY - PHP opcache health on mw1298 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:47:07] RECOVERY - php7.2-fpm service on mw1298 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:47:11] RECOVERY - PHP7 rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:47:13] RECOVERY - PHP7 jobrunner on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:47:15] RECOVERY - Nginx local proxy to videoscaler on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:47:19] RECOVERY - Check systemd state on mw1298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:25] RECOVERY - mcrouter process on mw1298 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [18:47:37] RECOVERY - nutcracker socket on mw1298 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_eqiad.sock https://wikitech.wikimedia.org/wiki/Nutcracker [18:49:05] (03PS12) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:49:08] (03PS1) 10Andrew Bogott: designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T212302) [18:50:01] (03CR) 10jerkins-bot: [V: 04-1] designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [18:51:03] RECOVERY - mcrouter certs expiration check on mw1298 is OK: MCROUTERCERTVERIFICATION OK - days_left_to_client_cert_expiration is 364 https://wikitech.wikimedia.org/wiki/Mcrouter [18:53:43] (03PS2) 10Andrew Bogott: designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T233258) [18:53:45] (03PS13) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:54:24] (03CR) 10jerkins-bot: [V: 04-1] designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T233258) (owner: 10Andrew Bogott) [18:54:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet [18:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:12] (03PS3) 10Andrew Bogott: designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T233258) [18:58:14] (03PS14) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:59:31] (03PS8) 10Jforrester: [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [19:00:05] twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train - American version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T1900). [19:00:31] 10Operations, 10serviceops, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [19:00:36] 10Operations, 10serviceops, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 05Stalled→03Resolved a:03Dzahn [19:00:38] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [19:02:33] !log There are currently no blockers for T220748 so I am preparing to deploy 1.34.0-wmf.23 to all wikis. [19:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:37] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [19:04:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:05:01] (03PS4) 10Andrew Bogott: designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T233258) [19:05:03] (03PS15) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [19:08:38] 10Operations, 10ops-codfw: refresh/replace scs-a1-codfw - https://phabricator.wikimedia.org/T231686 (10Papaul) Cr1/2-codfw both serial connections are up mr1 and msw1 serial connections up as well Working on the other serial connections [19:13:30] (03CR) 10Dzahn: "please don't use the same key in prod and on labs" [puppet] - 10https://gerrit.wikimedia.org/r/537508 (https://phabricator.wikimedia.org/T232707) (owner: 10Herron) [19:14:33] (03PS2) 10Jforrester: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 [19:14:35] (03PS9) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [19:15:41] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:15:49] (03PS10) 10Jforrester: Variant configuration: Pre-calculate config and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [19:16:44] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Pre-calculate config and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:18:12] (03CR) 10Jforrester: "Ah, hmm, CI images are running `version 1.6.5 2018-05-04 11:44:59` whereas my local (and most people's) is `version 1.9.0 2019-08-02 20:55" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:19:18] (03PS1) 1020after4: all wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538089 [19:19:20] (03CR) 1020after4: [C: 03+2] all wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538089 (owner: 1020after4) [19:20:23] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538089 (owner: 1020after4) [19:20:41] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538089 (owner: 1020after4) [19:21:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:22:22] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Koavf) >>! In T218155#5051940, @jayantanth wrote: > Just one small clarification wgTranslateNumerals = by 'default' => true, so during importi... [19:23:25] (03CR) 10Andrew Bogott: [C: 03+2] designate: switch to the worker/producer model [puppet] - 10https://gerrit.wikimedia.org/r/538085 (https://phabricator.wikimedia.org/T233258) (owner: 10Andrew Bogott) [19:25:51] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.23 refs T220748 [19:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:55] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [19:28:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [19:29:39] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:30:21] (03PS2) 10Dzahn: iegreview: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537762 (https://phabricator.wikimedia.org/T224247) [19:31:42] (03PS16) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [19:31:44] (03PS1) 10Andrew Bogott: designate: explicitly set config and logfile for designate-worker and -producer [puppet] - 10https://gerrit.wikimedia.org/r/538091 [19:31:46] (03PS1) 10Andrew Bogott: designate: ensure designate-zone-manager is not running [puppet] - 10https://gerrit.wikimedia.org/r/538092 (https://phabricator.wikimedia.org/T233258) [19:33:07] (03CR) 10Andrew Bogott: [C: 03+2] designate: explicitly set config and logfile for designate-worker and -producer [puppet] - 10https://gerrit.wikimedia.org/r/538091 (owner: 10Andrew Bogott) [19:33:20] (03CR) 10Andrew Bogott: [C: 03+2] designate: ensure designate-zone-manager is not running [puppet] - 10https://gerrit.wikimedia.org/r/538092 (https://phabricator.wikimedia.org/T233258) (owner: 10Andrew Bogott) [19:37:12] (03PS17) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [19:37:14] (03PS1) 10Andrew Bogott: designate: update monitoring for producer/worker model [puppet] - 10https://gerrit.wikimedia.org/r/538094 (https://phabricator.wikimedia.org/T233258) [19:38:42] (03CR) 10Andrew Bogott: [C: 03+2] designate: update monitoring for producer/worker model [puppet] - 10https://gerrit.wikimedia.org/r/538094 (https://phabricator.wikimedia.org/T233258) (owner: 10Andrew Bogott) [19:45:52] (03PS18) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [19:45:54] (03PS1) 10Andrew Bogott: designate monitoring: remove def for designate-pool-manager monitoring [puppet] - 10https://gerrit.wikimedia.org/r/538095 (https://phabricator.wikimedia.org/T233258) [19:46:07] (03PS2) 10Jhedden: maintain-replicas: Add ipb_sitewide field to ipblocks and ipblocks_ipindex [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) (owner: 10Alex Monk) [19:47:09] (03CR) 10Andrew Bogott: [C: 03+2] designate monitoring: remove def for designate-pool-manager monitoring [puppet] - 10https://gerrit.wikimedia.org/r/538095 (https://phabricator.wikimedia.org/T233258) (owner: 10Andrew Bogott) [19:48:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18433/miscweb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537762 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [19:48:43] (03PS3) 10Dzahn: iegreview: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537762 (https://phabricator.wikimedia.org/T224247) [19:49:38] (03CR) 10Jhedden: [C: 03+2] maintain-replicas: Add ipb_sitewide field to ipblocks and ipblocks_ipindex [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) (owner: 10Alex Monk) [19:49:49] (03PS3) 10Jhedden: maintain-replicas: Add ipb_sitewide field to ipblocks and ipblocks_ipindex [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) (owner: 10Alex Monk) [19:57:30] (03PS1) 10Mobrovac: [WIP][Beta] Parsoid: Add PHP7 config [puppet] - 10https://gerrit.wikimedia.org/r/538099 [20:02:33] (03PS2) 10Mobrovac: [Beta] Parsoid: Add PHP7 config [puppet] - 10https://gerrit.wikimedia.org/r/538099 (https://phabricator.wikimedia.org/T231569) [20:02:50] (03PS4) 10Dzahn: maintain-replicas: Add ipb_sitewide field to ipblocks and ipblocks_ipindex [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) (owner: 10Alex Monk) [20:03:24] (03CR) 10Mobrovac: [C: 03+1] "Applied to Beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/538099 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [20:06:10] (03PS2) 10Dzahn: wikimania_scholarships: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537763 (https://phabricator.wikimedia.org/T224247) [20:07:42] !log push firewall policies to pfw3-codfw - T233325 [20:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:25] (03CR) 10Dzahn: [C: 03+1] "this should remove hhvm" [puppet] - 10https://gerrit.wikimedia.org/r/538099 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [20:10:00] mutante: since ^ is beta-only, mind merging? (and yes, it does remove hhvm completely) [20:10:14] well, at least it removes it as a proxy target from apache for sure :) [20:10:52] (03CR) 10Dzahn: [C: 03+2] [Beta] Parsoid: Add PHP7 config [puppet] - 10https://gerrit.wikimedia.org/r/538099 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [20:11:01] (03PS3) 10Dzahn: [Beta] Parsoid: Add PHP7 config [puppet] - 10https://gerrit.wikimedia.org/r/538099 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [20:11:19] would be slightly nicer if we could use role-based hiera instead of host name but i see that is how it's done for all others too [20:11:41] yes, merging [20:11:43] yeah but i don't see an obvious way to do it in this case, but i agree [20:11:48] thnx mutante! [20:14:48] yw, it's on the master now. so if $install_hhvm is set to false that just means it does not include the profile class for it [20:15:22] !log push firewall policies to pfw3-eqiad - T233325 [20:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:48] it may not actively remove all remnants [20:17:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10ayounsi) [20:17:55] which could mean the cleanest way is to reinstall the OS on that VPS. but creating a new one is easier just then your host name changed again and need another puppet change [20:18:17] mutante: looking at the puppet output and pkg list it seems like the actual packages are not removed, but now there is no way for apache to talk to it, so that's fine [20:19:32] and /ur/bin/php is symlinked to php7.2 so all good [20:19:34] mobrovac: ok, or let's apt-get remove --purge hhvm [20:19:38] alright [20:19:50] doing just that now :) [20:19:59] cool [20:20:38] (03CR) 10Krinkle: "Hm... was this intentional? We've very intentionally excluded dev dependencies before, as https://gerrit.wikimedia.org/r/535703 did just i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [20:21:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Cmjohnson) @Jclark-ctr wipe, remove the servers, update netbox and the google sheet. Please assign back to me once everything is complete [20:21:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Cmjohnson) a:03Jclark-ctr [20:22:36] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Cmjohnson) a:03Jclark-ctr @Jclark-ctr wipe, remove the servers, update netbox and the google sheet. Please assign back to me once eve... [20:23:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18434/miscweb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537763 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [20:23:24] (03PS3) 10Dzahn: wikimania_scholarships: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537763 (https://phabricator.wikimedia.org/T224247) [20:23:48] (03CR) 10BryanDavis: [C: 03+1] tools-manifest: increase the timeout to 30s [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/536378 (https://phabricator.wikimedia.org/T220650) (owner: 10Bstorm) [20:23:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr wipe, remove the servers, update netbox and the google sheet. Please assign back to me once everything is complete [20:24:58] (03PS1) 10Urbanecm: Set wgGEConfirmEmailEnabled to false for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538100 [20:25:04] RoanKattouw: ^^^ [20:27:58] (03CR) 10Urbanecm: [C: 03+2] Set wgGEConfirmEmailEnabled to false for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538100 (owner: 10Urbanecm) [20:29:05] (03PS3) 10Dzahn: racktables: set db host in Hiera, set to eqiad, use lookup [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) [20:29:19] (03PS1) 10Krinkle: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 [20:30:48] (03PS2) 10Krinkle: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 [20:35:51] (03PS1) 10Cmjohnson: Removing mgmt dns decom host ms-be101[3-5] [dns] - 10https://gerrit.wikimedia.org/r/538103 (https://phabricator.wikimedia.org/T220590) [20:35:53] (03PS4) 10Dzahn: racktables: set db host in Hiera, set to eqiad, use lookup [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) [20:36:18] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns decom host ms-be101[3-5] [dns] - 10https://gerrit.wikimedia.org/r/538103 (https://phabricator.wikimedia.org/T220590) (owner: 10Cmjohnson) [20:37:21] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18436/miscweb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [20:37:58] (03PS5) 10Dzahn: racktables: set db host in Hiera, set to eqiad, use lookup [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) [20:41:23] (03PS2) 10Catrope: Set wgGEConfirmEmailEnabled to false for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538100 (https://phabricator.wikimedia.org/T233363) (owner: 10Urbanecm) [20:42:00] (03CR) 10Catrope: [C: 04-1] "Let's wait a little bit for Marshall to decide whether he wants us to disable this or not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538100 (https://phabricator.wikimedia.org/T233363) (owner: 10Urbanecm) [20:42:35] Urbanecm: Thanks! Marshall just told me about it, and he's unsure whether he wants to disable it or keep it on [20:43:20] Okay RoanKattouw. Either way, there IMO should be something in config (either true or false). [20:43:34] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, and 2 others: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Cmjohnson) [20:43:49] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, and 2 others: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Cmjohnson) 05Open→03Resolved [20:44:58] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Cmjohnson) a:05RobH→03Jclark-ctr John, please wipe the servers, remove from the rack,... [20:45:00] Urbanecm: Yes, agreed [20:45:51] (03PS3) 10Krinkle: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 [20:46:11] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Cmjohnson) a:03Jclark-ctr John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I can kill the switch ports. [20:52:26] (03PS1) 10Cmjohnson: Removing mgmt ip for decom labservices100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/538106 (https://phabricator.wikimedia.org/T221857) [20:53:30] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt ip for decom labservices100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/538106 (https://phabricator.wikimedia.org/T221857) (owner: 10Cmjohnson) [20:54:18] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10Cmjohnson) [20:54:26] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10Cmjohnson) 05Open→03Resolved [20:54:58] (03PS1) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 [20:55:32] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Cmjohnson) 05Open→03Resolved these were added to the tracking sheet [20:56:01] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001 & labcontrol1002 - https://phabricator.wikimedia.org/T221817 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I ca... [20:56:47] (03PS2) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 [20:56:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once... [20:57:18] (03PS3) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [20:57:23] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, please wipe the servers, remove from the rack, update n... [20:57:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I can kill the... [20:57:57] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Performance-Team-publish: The 5min expires for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) [20:58:02] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Performance-Team-publish: The 5min expiry for load.php/startup should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) [20:58:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, please wipe the servers, remove from the rack, update netbox and the tracking sheet. Assign back to me once you finish so I can kill the s... [20:59:39] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [21:02:38] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Emergency syslog message [21:03:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) 05Open→03Resolved It's been nearly 24 hours and there are 0 errors. resolving the task cmjohnson@asw2-c-eqiad> show interfaces xe-2/0/45 extensive |... [21:07:39] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message [21:16:21] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:17:46] ^ yea, that seems to be just a single host, bast in eqsin [21:18:52] but one host there is over 5% so it's considered widespread [21:19:16] or 20% per cluster [21:23:00] Applied catalog in 192.56 seconds [21:25:43] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:28:55] jouncebot: now [21:28:55] No deployments scheduled for the next 1 hour(s) and 31 minute(s) [21:29:13] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) >>! In T222099#5507968, @daniel wrote: >>>! In T222099#5507901, @Reedy wrote: >> So, yeah, it's Kask (even if indirectly), and the way MW uses it is breaking this behaviour. It wor... [21:29:35] I'm going to deploy a patch real quick. [21:30:58] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538063 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [21:31:49] (03Merged) 10jenkins-bot: Enable SpecialMute feature on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538063 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [21:32:06] (03CR) 10jenkins-bot: Enable SpecialMute feature on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538063 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [21:32:24] (03PS4) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [21:34:27] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [21:37:15] jouncebot: now [21:37:15] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [21:37:41] Oh hello hauskatze. Long time, no see. [21:37:48] !log niharika29@deploy1001 Synchronized wmf-config/VariantSettings.php: Enable special:mute on testwiki; T231577 (duration: 00m 56s) [21:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:52] T231577: Deploy Special:Mute features - https://phabricator.wikimedia.org/T231577 [21:37:59] (03PS5) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [21:38:09] Hi Niharika - long time indeed. Happy to see you. [21:38:58] hauskatze: :) I forgot, are you a steward? [21:39:52] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [21:40:32] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Date TBD) - https://phabricator.wikimedia.org/T226782 (10ayounsi) This is alerting: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a1-eqiad [21:40:35] Niharika: yes [21:42:13] hauskatze: Well then you would be happy to learn that we are thinking about making some improvements to CheckUser. :) [21:45:44] Niharika: Oh, awesome [21:46:20] (03PS6) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [21:56:15] (03PS7) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [21:58:33] (03CR) 10Cwhite: [C: 03+2] hiera: disable statsd_exporter::relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:58:41] (03PS18) 10Cwhite: hiera: disable statsd_exporter::relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [22:04:51] (03CR) 10Dzahn: [C: 04-1] "it's doing the opposite of the intention? https://puppet-compiler.wmflabs.org/compiler1001/18440/" [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:12:46] (03CR) 10Cwhite: [C: 03+1] ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [22:14:07] (03CR) 10Cwhite: [C: 03+2] Add git buildpackage configuration [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar) [22:14:57] (03CR) 10Dzahn: "change in the actual config:" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [22:17:00] (03PS2) 10Dzahn: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/536357 (https://phabricator.wikimedia.org/T222391) [22:17:28] (03CR) 10Cwhite: [C: 03+1] "I'm good to roll with it and tweak thresholds later if we need greater sensitivity. LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [22:18:53] (03CR) 10Dzahn: [C: 03+2] gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/536357 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:21:39] (03PS11) 10Cwhite: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [22:25:17] (03CR) 10BryanDavis: "Quite a few comments inline. All of them are questions about changing naming conventions or fixing non-material typos. I won't be sad if y" (035 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [22:25:56] (03CR) 10Cwhite: [C: 03+1] "Looks good to me. Might be worth applying to deployment-prep to be extra sure, but it appears reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [22:28:14] (03CR) 10Thcipriani: [C: 03+1] "Awesome, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/538082 (owner: 10Herron) [22:34:01] (03PS1) 10Dzahn: Revert "gerrit: add role on gerrit1001 and remove spare" [puppet] - 10https://gerrit.wikimedia.org/r/538122 [22:34:18] !log gerrit1001 - stopping puppet, removing gerrit IP from interface, rebooting [22:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:56] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: add role on gerrit1001 and remove spare" [puppet] - 10https://gerrit.wikimedia.org/r/538122 (owner: 10Dzahn) [22:38:58] 10Operations, 10Release-Engineering-Team-TODO, 10Traffic: Blubberoid endpoint intermittently routing to MediaWiki backend - https://phabricator.wikimedia.org/T233369 (10dduvall) [22:41:16] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:41:57] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:48:05] PROBLEM - Host gerrit1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:48:14] ! [22:48:39] ah [22:49:04] chaomodus: dont worry, it's not the prod gerrit and it's me [22:49:15] applying the role to a new server had some issue [22:49:44] RECOVERY - Host gerrit1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [22:49:50] i looked up the list slightly and saw that :) [22:51:36] it was a case of "all services on this host but not the host itself is in downtime" [22:51:45] sorry for even alterting at all [22:56:06] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10MusikAnimal) There is a thread on checkuser-l about this. At least three users intermittently got 503s today / this week when using Special:CheckUser. I don't know if it's related, but I'm assuming what @Dan... [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190919T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:58] (03PS1) 10Dzahn: install_server: gerrit1001 back to stretch [puppet] - 10https://gerrit.wikimedia.org/r/538126 [23:04:33] (03CR) 10Dzahn: [C: 03+2] install_server: gerrit1001 back to stretch [puppet] - 10https://gerrit.wikimedia.org/r/538126 (owner: 10Dzahn) [23:08:59] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` gerrit1001.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/201909192308_dzahn_39474_gerrit10... [23:21:11] (03PS1) 10Dzahn: add gerrit-new.wikimedia.org for migration [dns] - 10https://gerrit.wikimedia.org/r/538127 (https://phabricator.wikimedia.org/T222391) [23:21:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:34] (03PS1) 10Dzahn: gerrit: set gerrit-new as name/IP for new gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/538128 (https://phabricator.wikimedia.org/T222391) [23:26:49] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['gerrit1001.wikimedia.org'] ` and were **ALL** successful. [23:31:24] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) [23:33:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) a:05Papaul→03Dwisehaupt This is done at my end. [23:38:55] (03PS1) 10Jforrester: [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 [23:39:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (owner: 10Jforrester) [23:52:47] 10Operations, 10MobileFrontend, 10Traffic: Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10AntiCompositeNumber)