[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T0000). [00:12:35] PROBLEM - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 5.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [00:21:31] (03PS1) 10Dzahn: phabricator: install packages and apt::repo independent of using php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/541993 (https://phabricator.wikimedia.org/T190568) [00:25:33] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/18815/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/541993 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [00:28:32] (03CR) 10Dzahn: [C: 04-1] "doesn't work. causes duplicate declaration from the php class that is included when removing this if" [puppet] - 10https://gerrit.wikimedia.org/r/541993 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [00:38:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [00:38:58] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [00:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [00:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:51] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [00:39:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [00:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:04] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) thanks! mgmt password updated using cookbook. [01:03:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:22:45] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:43:59] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:05:48] (03CR) 10BryanDavis: tools-webservice: Disable access.log feature by default (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) (owner: 10Phamhi) [02:07:49] (03CR) 10BryanDavis: [C: 03+1] CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [02:59:43] (03CR) 10Subramanya Sastry: "Giueseppe / Daniel: Can we get this swatted tomorrow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [03:39:07] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MaxSem) [04:33:18] (03CR) 10Krinkle: "Where can I learn more about this issue?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [04:34:46] (03CR) 10Marostegui: [C: 03+1] "> Where can I learn more about this issue?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [04:41:58] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/542000 (https://phabricator.wikimedia.org/T235016) [04:43:13] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/542000 (https://phabricator.wikimedia.org/T235016) (owner: 10Marostegui) [04:43:58] !log Depool labsdb1011 for recloning - T235016 [04:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:02] T235016: Reclone labsdb1011 - https://phabricator.wikimedia.org/T235016 [04:49:27] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [04:49:32] ^ expected [04:50:35] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [04:50:59] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T235016 https://wikitech.wikimedia.org/wiki/HAProxy [04:50:59] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T235016 https://wikitech.wikimedia.org/wiki/HAProxy [04:53:54] !log Deploy schema change on db1061 (s6 eqiad master) - T233135 T234066 [04:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:59] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [04:53:59] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:03:42] (03PS1) 10Marostegui: site.pp: Remove db2057 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/542002 (https://phabricator.wikimedia.org/T230394) [05:04:20] (03PS1) 10Marostegui: wmnet: Remove db2057 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/542003 (https://phabricator.wikimedia.org/T230394) [05:04:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:53] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2057.codfw.wmnet` - db2057.codfw.wmnet (**PASS**)... [05:04:56] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2057 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/542003 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [05:05:13] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2057 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/542002 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [05:06:02] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) a:05RobH→03Papaul [05:06:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) Host ready for on-site + switch disablement steps [05:11:00] (03PS1) 10Marostegui: install_server: Allow install db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/542005 (https://phabricator.wikimedia.org/T234608) [05:12:01] (03CR) 10Marostegui: [C: 03+2] install_server: Allow install db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/542005 (https://phabricator.wikimedia.org/T234608) (owner: 10Marostegui) [05:27:44] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:38:14] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:47:40] !log Depool db1084 db1083 db1076 db1118 for PDU maintenance - T227536 [05:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:45] T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 [05:48:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 db1083 db1076 db1118 for PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9292 and previous config saved to /var/cache/conftool/dbconfig/20191010-054853-marostegui.json [05:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1078 into rc service for s3 for PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9293 and previous config saved to /var/cache/conftool/dbconfig/20191010-055102-marostegui.json [05:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for PDU maintenance T227536', diff saved to https://phabricator.wikimedia.org/P9294 and previous config saved to /var/cache/conftool/dbconfig/20191010-055153-marostegui.json [05:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:50] (03PS14) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [06:01:52] (03PS20) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [06:01:54] (03PS17) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [06:01:56] (03PS12) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [06:01:58] (03PS12) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [06:02:00] (03PS12) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [06:06:01] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [06:06:33] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [06:19:08] (03PS21) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [06:19:10] (03PS18) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [06:19:12] (03PS13) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [06:19:14] (03PS13) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [06:19:16] (03PS13) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [06:20:45] (03CR) 10Jeena Huneidi: [C: 03+2] mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [06:21:58] (03Merged) 10jenkins-bot: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) (owner: 10Brennen Bearnes) [06:28:05] (03PS22) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [06:28:07] (03PS19) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [06:28:09] (03PS14) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [06:28:11] (03PS14) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [06:28:13] (03PS14) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [06:30:16] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Given the fact that the GPU on stat1005 seems to work and we h... [06:30:17] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [06:30:20] (03CR) 10Mathew.onipe: "PCC look Ok: https://puppet-compiler.wmflabs.org/compiler1001/18818/" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [06:30:31] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @Nuria thoughts? [06:33:31] !log Revoke privileges from designate user on the designate_pool_manager database - T233978 [06:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:35] T233978: Drop 'designate_pool_manager' database from m5 and remove associated grants - https://phabricator.wikimedia.org/T233978 [06:33:47] (03PS23) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [06:33:49] (03PS20) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [06:33:51] (03PS15) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [06:33:53] (03PS15) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [06:33:55] (03PS15) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [06:41:02] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:42:53] (03PS1) 10Jeena Huneidi: Update mediawiki-dev chart README [deployment-charts] - 10https://gerrit.wikimedia.org/r/542011 (https://phabricator.wikimedia.org/T222494) [06:45:40] !log Drop designate_pool_manager database from m5 - T233978 [06:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:44] T233978: Drop 'designate_pool_manager' database from m5 and remove associated grants - https://phabricator.wikimedia.org/T233978 [06:48:48] RECOVERY - ElasticSearch shard size check - 9200 on logstash2002 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:48:48] RECOVERY - ElasticSearch shard size check - 9200 on logstash2001 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:48:52] RECOVERY - ElasticSearch shard size check - 9200 on logstash2003 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:48:52] RECOVERY - ElasticSearch shard size check - 9200 on logstash2006 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:48:52] RECOVERY - ElasticSearch shard size check - 9200 on logstash2005 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:48:52] RECOVERY - ElasticSearch shard size check - 9200 on logstash2004 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:53:11] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Interesting: ` Oct 10 03:00:01 krb2001 kpropd[26599]: Connection from krb1001.eqiad.wmnet Oct 10 03:26:24 krb2001 systemd[1]: Stopping Kerb... [06:56:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2063.codfw.wmnet` - db2063.codfw.wmnet (**PASS**) - Downtimed host on Ic... [06:57:42] (03PS1) 10Marostegui: site.pp: Remove db2063 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/542013 (https://phabricator.wikimedia.org/T230704) [06:58:02] (03PS1) 10Elukey: kerberos: add nagios process monitors to kdc/kadmind daemons [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) [06:58:28] (03PS1) 10Marostegui: wmnet: Remove db2063 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/542015 (https://phabricator.wikimedia.org/T230704) [06:58:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:59:29] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2063 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/542013 (https://phabricator.wikimedia.org/T230704) (owner: 10Marostegui) [06:59:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2063 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/542015 (https://phabricator.wikimedia.org/T230704) (owner: 10Marostegui) [07:00:59] (03PS2) 10Elukey: kerberos: add nagios process monitors to kdc/kadmind daemons [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) [07:01:00] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) a:05RobH→03Papaul [07:01:02] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) Host ready for on-site and switch disablement steps [07:05:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) ` elukey@krb2001:~$ sudo systemctl cat krb5-kpropd.service # /lib/systemd/system/krb5-kpropd.service [Unit] Descriptio... [07:07:32] (03CR) 10Muehlenhoff: kerberos: add nagios process monitors to kdc/kadmind daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:07:43] (03CR) 10Muehlenhoff: "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:12:04] (03CR) 10Elukey: kerberos: add nagios process monitors to kdc/kadmind daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:18:27] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The package was alread... [07:18:34] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10MoritzMuehlenhoff) [07:19:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:20:21] (03PS3) 10Elukey: kerberos: add nagios process monitors to kdc/kadmind daemons [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) [07:21:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:08] Telia planned work --^ [07:25:15] (03CR) 10Elukey: [C: 03+2] kerberos: add nagios process monitors to kdc/kadmind daemons [puppet] - 10https://gerrit.wikimedia.org/r/542014 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:45:53] (03CR) 10Muehlenhoff: "There's no need for a distro check; heirloom-mailx on stretch is a transitional package which installs s-nail (see "apt-cache show heirloo" [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [07:49:21] (03CR) 10Muehlenhoff: [C: 03+1] remove references to no-longer-in-use labpuppetmaster1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/541829 (https://phabricator.wikimedia.org/T234462) (owner: 10Andrew Bogott) [07:55:12] !log Stop MySQL on es1014 es1013 db1084 db1083 db1077 db1076 db1112 db1124 db1118 for on-site PDU maintenance (this will generate lag on labsdb hosts) - T227536 [07:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:16] T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 [07:58:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Nope, it seems that puppet is causing the stop/start of kpropd and rsync: ` Notice: /Stage[main]/Profile::Kerberos::K... [08:02:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Two daemons make sense: ` elukey@krb2001:~$ sudo systemctl status rsync ● rsync.service - fast remote file copy progr... [08:03:34] (03PS3) 10Mobrovac: Add wtp1025/wtp2001 to the list of servers using Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [08:04:06] * mobrovac taking over deploy1001 for a quick wmf-config deploy [08:07:35] 10Operations, 10Phabricator: List of recent most active Phab "Priority" field setters - https://phabricator.wikimedia.org/T235153 (10Aklapper) p:05Triage→03Low [08:07:41] (03CR) 10Mobrovac: [C: 03+2] Add wtp1025/wtp2001 to the list of servers using Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [08:08:29] (03Merged) 10jenkins-bot: Add wtp1025/wtp2001 to the list of servers using Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [08:08:30] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Marostegui) All the databases, es1013, es1014 and dbproxy1014 are good to go. dbstore1004 needs to be handled by @elukey, so please coordinate with him before working on that one. [08:09:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) >>! In T226089#5562297, @elukey wrote: > > The kadmin server seems stopping by itself, but kadmin.local works on 2001... [08:09:18] (03CR) 10Aaron Schulz: "I want to have the MW default be to disabled (and not have a WMF override enabling it again), so I wanted to see make that would be OK via" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [08:11:33] (03PS1) 10Mobrovac: RESTRouter: Use image v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542061 (https://phabricator.wikimedia.org/T170455) [08:12:15] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Add wtp1025/wtp2001 to the list of servers using Parsoid/PHP - T233654 (duration: 01m 01s) [08:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:19] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [08:12:26] * mobrovac is done with deploy1001 [08:14:08] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Use image v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542061 (https://phabricator.wikimedia.org/T170455) (owner: 10Mobrovac) [08:14:20] (03Merged) 10jenkins-bot: RESTRouter: Use image v1.1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542061 (https://phabricator.wikimedia.org/T170455) (owner: 10Mobrovac) [08:17:38] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [08:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:44] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' . [08:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:56] (03PS1) 10Elukey: kerberos: ensure kadmind and rsync only on the master node [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) [08:23:18] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [08:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:20] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [08:24:40] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [08:26:25] (03PS2) 10Elukey: kerberos: ensure kadmind and rsync only on the master node [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) [08:29:49] (03PS3) 10Elukey: kerberos: ensure kadmind and rsync only on the master node [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) [08:34:10] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18822/" [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:36:08] (03PS4) 10Elukey: kerberos: ensure kadmind and rsync only on the master node [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) [08:37:14] (03PS3) 10Gehel: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684 (owner: 10Mathew.onipe) [08:38:52] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18823/" [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:39:59] (03CR) 10Elukey: [C: 03+2] kerberos: ensure kadmind and rsync only on the master node [puppet] - 10https://gerrit.wikimedia.org/r/542062 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:46:04] (03PS1) 10Mobrovac: helmfile_log_sal: Fix getting the user and host for logging [puppet] - 10https://gerrit.wikimedia.org/r/542064 [08:46:59] (03CR) 10Volans: "Is this still needed or got superseeded?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [08:58:21] (03PS4) 10Gehel: wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684 (owner: 10Mathew.onipe) [08:59:26] (03CR) 10Gehel: [C: 03+2] wdqs: cleanup un-useful nginx config [puppet] - 10https://gerrit.wikimedia.org/r/541684 (owner: 10Mathew.onipe) [08:59:40] onimisionipe: ^ [09:02:23] alright. thanks! [09:02:34] will confirm [09:03:26] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.28 [software/spicerack] - 10https://gerrit.wikimedia.org/r/542066 [09:04:30] (03PS2) 10Arturo Borrero Gonzalez: CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) [09:04:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Seems fixed now. The culprit I believe it was: ` elukey@krb2001:~$ sudo systemctl cat krb5-kpropd.service # /lib/syst... [09:07:50] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.28 [software/spicerack] - 10https://gerrit.wikimedia.org/r/542066 (owner: 10Volans) [09:08:57] (03PS1) 10Elukey: kerberos: add nagios process monitoring for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/542067 (https://phabricator.wikimedia.org/T226089) [09:09:51] (03PS2) 10Elukey: kerberos: add nagios process monitoring for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/542067 (https://phabricator.wikimedia.org/T226089) [09:10:46] (03CR) 10Elukey: [C: 03+2] kerberos: add nagios process monitoring for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/542067 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [09:21:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [09:23:04] (03CR) 10Joal: "Thanks Erik :)" [puppet] - 10https://gerrit.wikimedia.org/r/541654 (owner: 10EBernhardson) [09:36:58] (03PS1) 10Jbond: puppet.wikimedia.org: move to codfw [dns] - 10https://gerrit.wikimedia.org/r/542070 (https://phabricator.wikimedia.org/T234315) [09:37:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/542070 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [09:39:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:40:12] * effie checking [09:40:27] (03CR) 10Jbond: [C: 03+2] puppet.wikimedia.org: move to codfw [dns] - 10https://gerrit.wikimedia.org/r/542070 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [09:40:31] (03PS2) 10Jbond: puppet.wikimedia.org: move to codfw [dns] - 10https://gerrit.wikimedia.org/r/542070 (https://phabricator.wikimedia.org/T234315) [09:42:02] (03PS4) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [09:46:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:56:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: introduce FQDN for cloudinstances2b-gw in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/542072 (https://phabricator.wikimedia.org/T234836) [09:58:07] (03CR) 10Volans: [V: 03+2 C: 03+2] "Sphinx still failing for https://github.com/psf/requests/issues/5212, merging anyway." [software/spicerack] - 10https://gerrit.wikimedia.org/r/542066 (owner: 10Volans) [09:58:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: introduce FQDN for cloudinstances2b-gw in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/542072 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [10:00:29] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.28 [software/spicerack] - 10https://gerrit.wikimedia.org/r/542066 (owner: 10Volans) [10:00:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:05] 10Puppet: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10MoritzMuehlenhoff) I looked into this and it's quite a mess! The debmonitor GID (which was the one clashing) is created in the postinst of debmonitor-client. Users/groups are added with s the default tool in Debian for t... [10:05:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:06] 10Operations, 10LDAP: Improve management of users/groups on servers in production - https://phabricator.wikimedia.org/T235161 (10Peachey88) [10:08:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:11:54] (03CR) 10Alexandros Kosiaris: "> I want to have the MW default be to disabled (and not have a WMF override enabling it again), so I wanted to see make that would be OK v" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [10:12:26] 10Operations: Improve management of users/groups on servers in production - https://phabricator.wikimedia.org/T235161 (10MoritzMuehlenhoff) [10:13:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:14:15] 10Operations: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10ayounsi) [10:14:17] 10Operations: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10MoritzMuehlenhoff) [10:14:56] 10Operations: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10Peachey88) [10:15:10] (03PS1) 10Volans: Upstream release v0.0.28 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/542075 [10:16:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:19:37] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.28 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/542075 (owner: 10Volans) [10:21:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:23:18] jouncebot next [10:23:18] In 0 hour(s) and 36 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1100) [10:30:10] (03CR) 10Volans: [V: 03+2 C: 03+2] "Sphinx still failing for https://github.com/psf/requests/issues/5212, merging anyway." [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/542075 (owner: 10Volans) [10:33:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:34:18] !log uploaded spicerack_0.0.28-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [10:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:38:23] (03PS1) 10Jbond: puppet.eqiad.wmnet: move to codfw [dns] - 10https://gerrit.wikimedia.org/r/542078 (https://phabricator.wikimedia.org/T234315) [10:38:58] !log rebalancing Ganeti eqiad/row C after rolling reboots of Ganeti nodes [10:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:44:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:45:52] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10aborrero) [10:49:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:57:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:08] o/ [11:00:21] I’m about to upload two config changes [11:00:26] might SWAT those if there’s nothing else to do [11:00:44] (*should* be beta-only but will be tested on mwdebug1002 to make sure they don’t break anything in prod) [11:03:54] T227536: b1-eqiad pdu refresh [11:03:55] T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 [11:04:16] starting T227536: b1-eqiad pdu refresh [11:04:48] (03PS1) 10Lucas Werkmeister (WMDE): Rename data bridge config variable names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542080 (https://phabricator.wikimedia.org/T235033) [11:04:52] (03PS1) 10Lucas Werkmeister (WMDE): Set dataBridgeEnabled repo setting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542081 (https://phabricator.wikimedia.org/T235033) [11:06:06] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Jclark-ctr) Starting b1-eqiad pdu refresh [11:06:43] ok, swatting my changes [11:06:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542080 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:07:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:07:36] (03Merged) 10jenkins-bot: Rename data bridge config variable names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542080 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:08:09] testing first change on mwdebug1002 [11:10:19] seems to be fine, syncing [11:13:10] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:542080|Rename data bridge config variable names (T235033)]] (affects IS-labs and CS, but the CS part is all guarded by isset(), so should be safe to sync both at once, I think) (duration: 01m 00s) [11:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:14] T235033: Set dataBridgeEnabled repo setting on Beta - https://phabricator.wikimedia.org/T235033 [11:13:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542081 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:14:08] (03Merged) 10jenkins-bot: Set dataBridgeEnabled repo setting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542081 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:14:51] !log ^ (and by CS, I actually mean Wikibase.php, not CommonSettings.php, sorry) [11:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:15:21] hm, looking… [11:15:50] ok, that’s an unusually high error rate since ca. 0:15 (UTC?) [11:16:04] concerning, but hopefully not my fault [11:16:19] testing the second change on mwdebug1002 [11:17:47] checking logstash [11:18:00] (unusally high error rate since *9*:15, sorry, typo) [11:19:00] nothing suspicious in logstash (but two occurrences of T228041 since I used shell.php) [11:19:01] T228041: Using shell.php in production sends warnings to Logstash - https://phabricator.wikimedia.org/T228041 [11:19:05] syncing [11:21:15] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:23] !log jbond@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:542081|Set dataBridgeEnabled repo setting on beta (T235033)]] (affects InitialiseSettings-labs.php and Wikibase.php, but Wikibase.php part is guarded by isset(), so should be safe to sync both at once, I think) (duration: 01m 00s) [11:21:32] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:34] T235033: Set dataBridgeEnabled repo setting on Beta - https://phabricator.wikimedia.org/T235033 [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:40] !log jbond@cumin2001 Updating IPMI password on 1253 hosts - jbond@cumin2001 [11:21:42] !log jbond@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:00] jbond42: are you already testing it? [11:22:02] anyone else have something to SWAT? [11:22:07] yes just starting now [11:22:07] I didn't upgrade the package yet [11:22:12] did you [11:22:13] ? [11:22:15] yes [11:22:20] ok :D [11:22:50] !log EU SWAT done [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:58] !log installing reportbug updates from stretch point release [11:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:30:33] ^ Amir1 that's another spike of deadlocks [11:32:43] (03PS1) 10Muehlenhoff: Revert renaming of update_type in YAML spec generation [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542085 [11:33:42] PROBLEM - Host cloudvirt1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:33:59] ah damn, looks like I made a typo in my config change :( [11:34:01] only affects Beta though [11:34:58] ACKNOWLEDGEMENT - Host cloudvirt1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Arturo Borrero Gonzalez PDU operations [11:35:16] PROBLEM - Host snapshot1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:42] PROBLEM - Host db1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:44] PROBLEM - Host db1077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:44] PROBLEM - Host db1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:47] !log Stop replication on db2077 to change triggers on db2095:3317 - T234704 [11:35:50] PROBLEM - Host dbproxy1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:51] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [11:35:54] PROBLEM - Host db1118.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:54] PROBLEM - Host db1112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:56] !log icinga downtime cloudvirt1026 for 2h (T227536) [11:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:00] T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 [11:36:01] (03PS1) 10Lucas Werkmeister (WMDE): Fix typo in beta repo data bridge config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542087 (https://phabricator.wikimedia.org/T235033) [11:36:14] PROBLEM - Host db1124.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:36:14] PROBLEM - Host dbstore1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:36:14] PROBLEM - Host es1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:36:14] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:21] !log jbond@cumin2001 Updating IPMI password on 1253 hosts - jbond@cumin2001 [11:36:22] !log jbond@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:24] PROBLEM - Host db1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] !log icinga downtime cloudvirt1025 for 2h (T227536) [11:36:34] PROBLEM - Host logstash1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:12] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:37:16] !log icinga downtime cloudvirt1023 for 2h (T227536) [11:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:30] PROBLEM - Host an-coord1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:37:40] PROBLEM - Host kafka-jumbo1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:37:46] PROBLEM - Host es1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:38:02] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:11] !log jbond@cumin2001 Updating IPMI password on 1253 hosts - jbond@cumin2001 [11:38:11] !log jbond@cumin2001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix typo in beta repo data bridge config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542087 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:38:58] (03Merged) 10jenkins-bot: Fix typo in beta repo data bridge config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542087 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:40:46] RECOVERY - Host kafka-jumbo1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.25 ms [11:40:46] RECOVERY - Host cloudvirt1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [11:40:52] RECOVERY - Host es1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [11:40:53] !log Deploy schema change on s7 codfw master (db2118), this will generate lag on s7 codfw - T234066 T233135 [11:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:58] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [11:40:58] RECOVERY - Host snapshot1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [11:40:58] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [11:41:03] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:542087|Fix typo in beta repo data bridge config (T235033)]] (duration: 00m 59s) [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] T235033: Set dataBridgeEnabled repo setting on Beta - https://phabricator.wikimedia.org/T235033 [11:41:24] RECOVERY - Host db1083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [11:41:26] RECOVERY - Host db1084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [11:41:26] RECOVERY - Host db1077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [11:41:34] RECOVERY - Host dbproxy1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.05 ms [11:41:36] RECOVERY - Host db1112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [11:41:36] RECOVERY - Host db1118.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [11:41:58] RECOVERY - Host db1124.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [11:41:58] RECOVERY - Host dbstore1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [11:42:06] RECOVERY - Host db1076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [11:42:16] RECOVERY - Host logstash1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [11:42:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not sure this is really worth it. I 'd prefer that we address the problem instead. Making it alert less could have the adverse effect of b" [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli) [11:43:12] RECOVERY - Host an-coord1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [11:43:28] RECOVERY - Host es1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [11:46:00] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:11] !log jbond@cumin2001 Updating IPMI password on 35 hosts - jbond@cumin2001 [11:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:17] (03PS1) 10Arturo Borrero Gonzalez: redis: introduce config file for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/542088 [11:46:39] (03CR) 10Lucas Werkmeister (WMDE): Set dataBridgeEnabled repo setting on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542081 (https://phabricator.wikimedia.org/T235033) (owner: 10Lucas Werkmeister (WMDE)) [11:47:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] redis: introduce config file for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/542088 (owner: 10Arturo Borrero Gonzalez) [11:48:37] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:13] 04Critical Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [11:52:38] (03CR) 10Muehlenhoff: "trusty is gone and the remaining jessie, stretch configs are identical, you could also simply fold the content into redis-common now." [puppet] - 10https://gerrit.wikimedia.org/r/542088 (owner: 10Arturo Borrero Gonzalez) [11:53:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:56:22] PROBLEM - Host wdqs1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:26] PROBLEM - Host authdns1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:34] PROBLEM - Host dbstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:34] PROBLEM - Host snapshot1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:36] PROBLEM - Host an-coord1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:36] PROBLEM - Host kafka-jumbo1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:57:09] ouch [11:57:12] PROBLEM - Host logstash1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:16] :-/ [11:57:20] (03PS1) 10Jbond: ipmi: The change to subprocess.run() failed to capture stdout [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) [11:57:26] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:38] !log jbond@cumin2001 START - Cookbook sre.hosts.ipmi-password-reset [11:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:45] !log jbond@cumin2001 Updating IPMI password on 1253 hosts - jbond@cumin2001 [11:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:52] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [11:57:52] PROBLEM - Host ns0-v4 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:58] PROBLEM - Host dbproxy1014 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:22] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:00:00] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:00:18] RECOVERY - Host an-coord1001 is UP: PING WARNING - Packet loss = 86%, RTA = 0.24 ms [12:00:22] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:00:22] RECOVERY - Host logstash1011 is UP: PING WARNING - Packet loss = 50%, RTA = 0.23 ms [12:00:24] RECOVERY - Host wdqs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [12:00:24] RECOVERY - Host snapshot1008 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [12:00:24] RECOVERY - Host dbstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:00:26] RECOVERY - Host kafka-jumbo1003 is UP: PING OK - Packet loss = 0%, RTA = 6.12 ms [12:00:28] RECOVERY - Host dbproxy1014 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:00:38] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [12:00:38] RECOVERY - Host authdns1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [12:00:44] RECOVERY - Host ns0-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:01:35] (03CR) 10jerkins-bot: [V: 04-1] ipmi: The change to subprocess.run() failed to capture stdout [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [12:02:13] (03PS1) 10Elukey: kerberos: test kadmin failover/swap [puppet] - 10https://gerrit.wikimedia.org/r/542092 (https://phabricator.wikimedia.org/T226089) [12:02:49] (03CR) 10Jbond: "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542085 (owner: 10Muehlenhoff) [12:03:29] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert renaming of update_type in YAML spec generation [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542085 (owner: 10Muehlenhoff) [12:03:33] 04Critical Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Emergency syslog message [12:03:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:03:50] (03PS7) 10Arturo Borrero Gonzalez: toolforge: refactor proxy role from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [12:04:04] (03CR) 10Jbond: [C: 03+2] puppet.eqiad.wmnet: move to codfw [dns] - 10https://gerrit.wikimedia.org/r/542078 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [12:04:16] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:05:52] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:07:33] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message [12:11:03] (03PS9) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [12:13:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:17:12] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active [12:20:18] (03PS2) 10Phamhi: tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) [12:21:31] marostegui: sorry, I was at meetings. Is it still going on? [12:21:40] (03PS3) 10Phamhi: tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) [12:21:47] Amir1: Yes [12:22:00] Amir1: https://logstash.wikimedia.org/goto/353ae696b401fe30c1eda5ec575021aa [12:23:33] sh*t [12:23:42] let me talk to my teammate [12:24:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:28:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:28:28] b1-eqiad pdu refresh physical swap is finished everything is back up with redundant power.. will just be labeling [12:30:33] (03CR) 10Elukey: [C: 03+2] kerberos: test kadmin failover/swap [puppet] - 10https://gerrit.wikimedia.org/r/542092 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [12:34:04] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542094 [12:36:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542094 (owner: 10Marostegui) [12:37:01] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542094 (owner: 10Marostegui) [12:40:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:41:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool es1013, es1014 after PDU maintenance (duration: 00m 59s) [12:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:23] (03PS1) 10Elukey: Revert "kerberos: test kadmin failover/swap" [puppet] - 10https://gerrit.wikimedia.org/r/542095 [12:43:53] (03CR) 10Elukey: [C: 03+2] Revert "kerberos: test kadmin failover/swap" [puppet] - 10https://gerrit.wikimedia.org/r/542095 (owner: 10Elukey) [12:49:04] (03CR) 10Mvolz: "This is just to enable ref tabs, we'd need to set $wgWBCitoidFullRestbaseURL = 'http://en.wikipedia.org/api/rest_'; as well to enable cito" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [12:49:17] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Jclark-ctr) Swap is finished everything is back up with redundant power.. updated netbox with for old and new pdu. [12:51:28] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Jclark-ctr) [12:51:47] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Jclark-ctr) a:05Jclark-ctr→03RobH [12:53:28] 10Operations, 10Core Platform Team, 10Editing-team, 10Fundraising-Backlog, and 11 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10phuedx) [12:56:36] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:59:41] ^ me [12:59:50] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:00:15] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [13:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:32] (03PS1) 10Marostegui: db-eqiad.php: More traffic to es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542096 [13:01:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:02:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542096 (owner: 10Marostegui) [13:03:21] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to es1013,es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542096 (owner: 10Marostegui) [13:03:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10MoritzMuehlenhoff) >>! In T226089#5559672, @MoritzMuehlenhoff wrote: >>>! In T226089#5559492, @elukey wrote: >>>>! In T226089#... [13:05:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to es1013, es1014 after PDU maintenance (duration: 00m 58s) [13:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:56] PROBLEM - Check systemd state on krb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:08] checking --^ [13:15:35] (03PS2) 10Ottomata: Bumping up refine to newest version [puppet] - 10https://gerrit.wikimedia.org/r/541929 (https://phabricator.wikimedia.org/T234461) (owner: 10Nuria) [13:16:47] !log added flannel 0.5.5-4 to buster-wikimedia (T235059) [13:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:53] T235059: Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster - https://phabricator.wikimedia.org/T235059 [13:17:25] (03CR) 10Ottomata: [C: 03+2] Bumping up refine to newest version [puppet] - 10https://gerrit.wikimedia.org/r/541929 (https://phabricator.wikimedia.org/T234461) (owner: 10Nuria) [13:25:05] (03PS1) 10Elukey: kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) [13:25:07] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012" [puppet] - 10https://gerrit.wikimedia.org/r/542103 [13:25:15] (03PS2) 10Marostegui: Revert "dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012" [puppet] - 10https://gerrit.wikimedia.org/r/542103 [13:25:43] (03PS3) 10Marostegui: Revert "dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012" [puppet] - 10https://gerrit.wikimedia.org/r/542103 [13:26:42] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1011 to reclone it from labsdb1012" [puppet] - 10https://gerrit.wikimedia.org/r/542103 (owner: 10Marostegui) [13:27:25] !log Repool labsdb1011 after reclone - T235016 [13:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:29] T235016: Reclone labsdb1011 - https://phabricator.wikimedia.org/T235016 [13:30:12] o/ herron are you on clinic duty this week? [13:30:14] (03CR) 10Elukey: [C: 03+2] kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:30:21] if so, could you take care of https://phabricator.wikimedia.org/T234473 ? [13:30:21] (03PS2) 10Elukey: kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) [13:30:23] (03CR) 10Muehlenhoff: kerberos: ensure resources that might change during failover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:31:15] (03CR) 10Elukey: [C: 03+2] kerberos: ensure resources that might change during failover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:31:26] (03CR) 10Elukey: kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:31:40] ottomata: hey, no, stale topic here. but sure will take a look anyway [13:32:05] !log reimage puppetmaster1001 [13:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:26] (03PS3) 10Elukey: kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) [13:33:46] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019... [13:33:52] herron: thanks, or ping whoever is on clinic duty [13:33:53] ty! [13:34:32] (03PS4) 10Elukey: kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) [13:36:00] (03CR) 10Elukey: [C: 03+2] kerberos: ensure resources that might change during failover [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:36:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542102 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:36:45] (03PS1) 10Jbond: puppetmaster1001: updateboot image to buster [puppet] - 10https://gerrit.wikimedia.org/r/542105 (https://phabricator.wikimedia.org/T234315) [13:37:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542105 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:39:04] jbond42: should we use 1002 to puppet-merge? [13:39:19] no probably 2001 [13:39:43] 2001 [13:39:59] although i think it should wiork from anyof them [13:40:15] (03PS2) 10Jbond: puppetmaster1001: updateboot image to buster [puppet] - 10https://gerrit.wikimedia.org/r/542105 (https://phabricator.wikimedia.org/T234315) [13:41:17] ack! [13:41:33] (03CR) 10Jbond: [C: 03+2] puppetmaster1001: updateboot image to buster [puppet] - 10https://gerrit.wikimedia.org/r/542105 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:44:30] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.02384 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:44:53] looking ^^ [13:45:46] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:47:23] (03Abandoned) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [13:48:32] RECOVERY - Check systemd state on krb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:29] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019... [13:49:36] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['puppetmaster1001.eqiad.wmnet'] ` [13:50:25] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019... [13:50:36] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['puppetmaster1001.eqiad.wmnet'] ` [13:50:45] !log disable puppet fleet wide as puppetmaster2002 is stuggeling [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) >>! In T226089#5562842, @MoritzMuehlenhoff wrote: >>>! In T226089#5559672, @MoritzMuehlenhoff wrote: >>>>! In T226089#... [13:53:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10MoritzMuehlenhoff) >>! In T226089#5562958, @elukey wrote: >>>! In T226089#5562842, @MoritzMuehlenhoff wrote: >>>>! In T226089#... [13:54:46] (03PS1) 10Jbond: puppetmaster: update ca to puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/542108 (https://phabricator.wikimedia.org/T234315) [13:55:29] (03CR) 10Jbond: [C: 03+2] puppetmaster: update ca to puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/542108 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:56:29] 10Operations, 10ops-codfw, 10Cloud-Services: Build, package bdsync for Buster - https://phabricator.wikimedia.org/T234683 (10Andrew) p:05Triage→03Normal a:05Andrew→03aborrero [13:57:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 db1083 db1076 db1118 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9299 and previous config saved to /var/cache/conftool/dbconfig/20191010-135659-marostegui.json [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:25] (03PS2) 10Andrew Bogott: remove references to no-longer-in-use labpuppetmaster1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/541829 (https://phabricator.wikimedia.org/T234462) [13:58:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9300 and previous config saved to /var/cache/conftool/dbconfig/20191010-135806-marostegui.json [13:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:29] (03CR) 10Andrew Bogott: [C: 03+2] remove references to no-longer-in-use labpuppetmaster1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/541829 (https://phabricator.wikimedia.org/T234462) (owner: 10Andrew Bogott) [14:00:42] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019... [14:01:14] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542109 [14:02:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542109 (owner: 10Marostegui) [14:02:56] (03PS2) 10Andrew Bogott: beta: dont include scap::scripts twice [puppet] - 10https://gerrit.wikimedia.org/r/512859 (owner: 10Alex Monk) [14:03:05] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1013, es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542109 (owner: 10Marostegui) [14:03:55] !log re-enable puppet now ca has been correctly moved [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool es1013, es1014 after PDU maintenance (duration: 00m 59s) [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] (03CR) 10Andrew Bogott: [C: 03+2] beta: dont include scap::scripts twice [puppet] - 10https://gerrit.wikimedia.org/r/512859 (owner: 10Alex Monk) [14:06:22] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:10:48] (03PS1) 10Elukey: kerberos: test failover (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/542112 (https://phabricator.wikimedia.org/T226089) [14:11:23] (03CR) 10Elukey: [C: 03+2] kerberos: test failover (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/542112 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [14:12:53] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['puppetmaster1001.eqiad.wmnet'] ` [14:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 db1083 db1076 db1112 db1118 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9301 and previous config saved to /var/cache/conftool/dbconfig/20191010-141303-marostegui.json [14:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:26] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:14:02] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:14:33] looking [14:15:00] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 7.651 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:16:27] (03PS2) 10Ottomata: Ensure eventlogging-consumer mysql is absent on eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/541359 (https://phabricator.wikimedia.org/T223414) [14:16:33] (03CR) 10Elukey: [C: 03+1] Ensure eventlogging-consumer mysql is absent on eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/541359 (https://phabricator.wikimedia.org/T223414) (owner: 10Ottomata) [14:19:10] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18829/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/541359 (https://phabricator.wikimedia.org/T223414) (owner: 10Ottomata) [14:19:56] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.0183 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:21:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Papaul) ` [edit interfaces interface-range disabled] member ge-6/0/6 { ... } + member ge-6/0/5; [edit interfaces] - ge-6/0/5 { - description db2057; - en... [14:22:02] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Papaul) [14:23:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1084 db1083 db1076 db1112 db1118 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9302 and previous config saved to /var/cache/conftool/dbconfig/20191010-142323-marostegui.json [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:34] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:25:05] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/11; [edit interfaces interface-range disabled] me... [14:25:45] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Papaul) [14:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool to db1084 db1083 db1076 db1112 db1118 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9303 and previous config saved to /var/cache/conftool/dbconfig/20191010-143633-marostegui.json [14:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1112 into recentchanges and remove db1078 from it after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9304 and previous config saved to /var/cache/conftool/dbconfig/20191010-143924-marostegui.json [14:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Jclark-ctr) [14:40:16] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Jclark-ctr) 05Open→03Resolved updated netbox with new pdu`s [14:42:02] (03PS1) 10Muehlenhoff: Move parsing of Cumin alias/query outside of a global option [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542125 [14:42:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1074 after BBU replacement T231638', diff saved to https://phabricator.wikimedia.org/P9305 and previous config saved to /var/cache/conftool/dbconfig/20191010-144201-marostegui.json [14:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:06] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [14:43:13] 10Operations, 10DBA, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Xaosflux) [14:44:11] (03PS1) 10Elukey: Revert "kerberos: test failover (part 2)" [puppet] - 10https://gerrit.wikimedia.org/r/542126 [14:44:16] (03PS2) 10Muehlenhoff: Move parsing of Cumin alias/query outside of a global option [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542125 [14:44:18] (03PS2) 10Elukey: Revert "kerberos: test failover (part 2)" [puppet] - 10https://gerrit.wikimedia.org/r/542126 [14:45:20] (03CR) 10Elukey: [C: 03+2] Revert "kerberos: test failover (part 2)" [puppet] - 10https://gerrit.wikimedia.org/r/542126 (owner: 10Elukey) [14:47:58] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:50:14] (03CR) 10Dzahn: "ACK, i noticed this after merge too but did not get to follow-up. will do." [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [14:51:46] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10Dzahn) Could confirm yesterday i can login again with the hotfix. Thanks! [14:52:34] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10jrobell) Hi all, Any news on this task? We'd highly appreciate your support to get Erin and Jerrie access as soon as possible as there are cert... [14:52:36] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:30] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01011 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:54:47] !log ran systemctl reset-failed on puppetmaster1001 (puppet-master.service after reimage) [14:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:54] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:42] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 9.159 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:56:10] (03CR) 10Brennen Bearnes: [C: 03+2] Update mediawiki-dev chart README [deployment-charts] - 10https://gerrit.wikimedia.org/r/542011 (https://phabricator.wikimedia.org/T222494) (owner: 10Jeena Huneidi) [14:57:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1074 after getting its BBU replaced T231638', diff saved to https://phabricator.wikimedia.org/P9306 and previous config saved to /var/cache/conftool/dbconfig/20191010-145737-marostegui.json [14:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:42] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [14:58:03] (03PS1) 10Muehlenhoff: Update late.sh for hosts no longer using puppet 4 client packages [puppet] - 10https://gerrit.wikimedia.org/r/542131 [14:58:24] (03PS1) 10Herron: admin: add dedcode to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542132 (https://phabricator.wikimedia.org/T234473) [14:59:34] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:00:12] (03CR) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [15:00:50] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:01:42] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10herron) Hello, I've uploaded a patch set for this access. Typically yes @nuria approves additions to analytics groups. Once that'... [15:01:49] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10herron) [15:05:04] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 3.970 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:09:36] (03PS1) 10Elukey: kerberos: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542133 (https://phabricator.wikimedia.org/T226089) [15:10:15] (03CR) 10Elukey: [C: 03+2] kerberos: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542133 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [15:11:20] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:14:11] 10Puppet: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) [15:15:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Tested the failover and improved the puppet code to do proper clean ups when failing back to the original state. Tested a change in password... [15:16:09] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) @EYener in the task description it looks like the ssh key fingerprint was provided, instead of the ssh public key itself. Could you pl... [15:17:22] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10Nuria) Approved on my end. [15:18:00] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1adf74e]: Update mobileapps to c89aa55 [15:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:40] (03PS1) 10Jbond: puppet.eqsin.wmnet: move back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) [15:19:42] (03PS1) 10Jbond: puppet.eqiad.wmnet: moveback to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) [15:19:56] (03CR) 10jerkins-bot: [V: 04-1] puppet.eqsin.wmnet: move back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:20:14] (03CR) 10jerkins-bot: [V: 04-1] puppet.eqiad.wmnet: moveback to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:20:20] (03CR) 10Muehlenhoff: puppet.eqsin.wmnet: move back to eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:22:50] (03PS2) 10Jbond: puppet.eqsin.wmnet: move back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) [15:23:02] (03CR) 10Jbond: "thanks" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:23:10] (03PS4) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [15:23:17] (03PS2) 10Jbond: puppet.eqiad.wmnet: moveback to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) [15:23:39] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1adf74e]: Update mobileapps to c89aa55 (duration: 05m 39s) [15:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:03] (03PS1) 10Ayounsi: Move users to YAML common instead of templates [homer/public] - 10https://gerrit.wikimedia.org/r/542139 [15:24:17] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 3.929 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:24:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:25:05] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) 05Open→03Resolved [15:25:10] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [15:25:14] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) ta-tachannnn!!!! [15:25:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:25:17] (03CR) 10Jbond: [C: 03+2] puppet.eqsin.wmnet: move back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542137 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:26:34] (03PS1) 10Herron: admin: add jkumalah to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542141 (https://phabricator.wikimedia.org/T234433) [15:27:39] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:28:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10herron) Hi @Nuria could you please review this for approval? [15:29:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10Nuria) Approved on my end [15:29:17] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [15:30:52] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) @Jclark-ctr and I went through the following to fix this issue: * tested (failed) scs-a8-eqiad port 2 to ps1-a2-eqiad connection * tested (works) scs-a8-eqiad:3 to ps1-a3-e... [15:32:13] PROBLEM - ps1-a2-eqiad-infeed-load-tower-B-phase-Z on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:32:25] PROBLEM - ps1-a2-eqiad-infeed-load-tower-B-phase-Y on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:32:25] PROBLEM - ps1-a2-eqiad-infeed-load-tower-B-phase-X on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:32:47] PROBLEM - ps1-a2-eqiad-infeed-load-tower-A-phase-Y on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:37:15] (03PS1) 10Jakob: Update termbox test to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542151 [15:38:34] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/542151 (owner: 10Jakob) [15:39:58] (03CR) 10Ladsgroup: [C: 03+2] Update termbox test to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542151 (owner: 10Jakob) [15:40:08] (03PS1) 10CDanis: vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 [15:40:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update termbox test to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542151 (owner: 10Jakob) [15:41:00] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [15:41:02] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) 05Open→03Resolved Please note that with the temp serial run, we went ahead and setup ps1-a2-eqiad. The existing serial needs to be fixed though. [15:41:43] PROBLEM - ps1-a2-eqiad-infeed-load-tower-A-phase-X on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:42:03] 10Operations, 10ops-eqiad, 10DC-Ops: fix serial connection for ps1-a2-eqiad - https://phabricator.wikimedia.org/T235190 (10RobH) [15:42:12] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] 10Operations, 10ops-eqiad, 10DC-Ops: fix serial connection for ps1-a2-eqiad - https://phabricator.wikimedia.org/T235190 (10RobH) p:05Triage→03Normal [15:43:02] (03PS1) 10Jbond: puppet.wikimedia.org.: move back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542154 (https://phabricator.wikimedia.org/T234315) [15:43:10] 04Critical Alert for device ps1-a3-eqiad.mgmt.eqiad.wmnet - Device rebooted [15:45:20] (03PS1) 10Jakob: Update termbox ssr staging to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542155 [15:45:33] PROBLEM - ps1-a2-eqiad-infeed-load-tower-A-phase-Z on ps1-a2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:36] (03PS1) 10Jakob: Update termbox ssr codfw to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542156 [15:46:38] (03PS1) 10Jakob: Update termbox ssr eqiqad to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542157 [15:47:07] (03PS2) 10Jakob: Update termbox ssr eqiad to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542157 [15:47:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10herron) @kevinbazira could you please review and sign the L3 Acknowledgement of Wikimedia Server Access Responsi... [15:48:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10herron) [15:48:20] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10herron) [15:48:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) [15:49:29] PROBLEM - Kerberos KAdmin daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [15:50:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Papaul) [15:51:30] (03CR) 10Jbond: [C: 03+2] puppet.wikimedia.org.: move back to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542154 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:53:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-a3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [15:53:33] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10herron) [15:58:10] 04Critical Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - Device rebooted [15:58:53] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10herron) Hello, @CorinnaHillebrand_WMDE, could you please review and sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document, update the task descr... [15:59:49] (03CR) 10Jbond: [C: 03+2] puppet.eqiad.wmnet: moveback to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:59:57] (03PS3) 10Jbond: puppet.eqiad.wmnet: moveback to puppetmaster1001 [dns] - 10https://gerrit.wikimedia.org/r/542138 (https://phabricator.wikimedia.org/T234315) [16:00:05] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1600). Please do the needful. [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:55] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10herron) Hello, could you please expand on this request? What resources are meant to be accessed, and do you know specifically what LDAP group? Thanks in advance! [16:01:10] !log Upgrading sessionstore1001.eqiad.wmnet to Cassandra 3.11.4 -- T200803 [16:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:22] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [16:03:05] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/542155 (owner: 10Jakob) [16:03:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-a2-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [16:03:30] (03CR) 10Jakob: [C: 03+2] Update termbox ssr staging to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542155 (owner: 10Jakob) [16:03:43] (03PS1) 10Elukey: profile::kerberos::kadminserver: fix typo in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542159 (https://phabricator.wikimedia.org/T226089) [16:03:45] (03CR) 10Jakob: [V: 03+2 C: 03+2] Update termbox ssr staging to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542155 (owner: 10Jakob) [16:03:47] (03Merged) 10jenkins-bot: Update termbox ssr staging to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/542155 (owner: 10Jakob) [16:04:17] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [16:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:31] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: fix typo in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542159 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [16:04:38] !log restarting gerrit due to T224448 [16:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:42] T224448: Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely - https://phabricator.wikimedia.org/T224448 [16:06:20] (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/542156 (owner: 10Jakob) [16:07:25] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [16:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:33] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/542159 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [16:08:40] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10nnikkhoui) AH apologies! Should be the "wmf" LDAP group [16:10:44] (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/542157 (owner: 10Jakob) [16:11:22] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [16:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:47] !log Upgrading sessionstore1002.eqiad.wmnet to Cassandra 3.11.4 -- T200803 [16:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:54] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [16:18:09] !log Upgrading sessionstore1003.eqiad.wmnet to Cassandra 3.11.4 -- T200803 [16:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:28] !log Upgrading sessionstore200[1-3].codfw.wmnet to Cassandra 3.11.4 -- T200803 [16:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:13] jouncebot: next [16:23:13] In 0 hour(s) and 36 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1700) [16:23:39] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:24:22] marxarelli: andrewbogott and I have things ready for wikitech to get back on the train. There is one backport that needs to land to make it all happen smoothly -- https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/OpenStackManager/+/542168/ [16:25:00] marxarelli: can that just go into the train today or do you want me to SWAT it? [16:25:40] (03PS1) 10Ottomata: Re-enable eventlogging-mysql-consumer until we are more ready [puppet] - 10https://gerrit.wikimedia.org/r/542169 (https://phabricator.wikimedia.org/T223414) [16:25:41] bd808: I can roll it into the train [16:25:58] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Re-enable eventlogging-mysql-consumer until we are more ready [puppet] - 10https://gerrit.wikimedia.org/r/542169 (https://phabricator.wikimedia.org/T223414) (owner: 10Ottomata) [16:26:06] marxarelli: beverage of your choice at next IRL meeting :) [16:26:22] bd808: I might do labswiki on its own before all wikis [16:26:42] nice :) [16:26:47] *nod* that seems like a nice safety step [16:29:15] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) [16:34:43] RECOVERY - Kerberos KAdmin daemon on krb1001 is OK: PROCS OK: 1 process with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [16:39:07] (03PS2) 10Cwhite: Parameterize path so as to better integrate with Prometheus service discovery. Parameterize spec_segment. Maintain backwards compatibility. Improvements preparing and sanitizing the url for sending to CheckService. Add documentation Update requirements.txt to better deal with gevent pip install. (It doesn't like to compile that old version on my machine.) [debs/prometheus-swagger-exporter] - 10https://ger [16:42:50] (03PS1) 10Andrew Bogott: nova-fullstack tests: add a test for dns cleanup post-delete [puppet] - 10https://gerrit.wikimedia.org/r/542170 (https://phabricator.wikimedia.org/T235129) [16:44:53] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:45:13] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 2.904 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:46:43] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 2.727 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:50:55] 10Operations, 10DBA, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Johan) This only affects English Wikipedia, right? [16:52:01] 10Operations, 10DBA, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) Yep! [16:54:07] (03CR) 10Andrew Bogott: [C: 03+1] openstack: drop jessie code [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [16:55:27] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002926 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:55:55] whoop whoop finnaly :( [16:56:04] :) [16:57:03] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001463 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:58:16] 10Operations, 10MediaWiki-extensions-OATHAuth, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10WDoranWMF) [16:58:29] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10Aklapper) Is this about environmental sustainability, or some other sustainability? Should probably have a good description... [16:59:12] (03PS1) 10Paladox: Gerrit: Up the "accounts" cache to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/542174 [17:00:04] cscott, arlolra, subbu, halfak, and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1700). [17:00:21] (03PS2) 10Paladox: Gerrit: Up the "accounts" cache to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/542174 (https://phabricator.wikimedia.org/T224448) [17:01:33] (03CR) 10Jhedden: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/542170 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [17:06:25] (03Abandoned) 10Paladox: Gerrit: Up the "accounts" cache to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/542174 (https://phabricator.wikimedia.org/T224448) (owner: 10Paladox) [17:07:00] !log puppetmaster1001 has been upgraded and is back serving requests [17:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:04] (03PS2) 10Alexandros Kosiaris: restbase: Cassandra client access from k8s [puppet] - 10https://gerrit.wikimedia.org/r/541911 (https://phabricator.wikimedia.org/T234374) (owner: 10Eevans) [17:08:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] restbase: Cassandra client access from k8s [puppet] - 10https://gerrit.wikimedia.org/r/541911 (https://phabricator.wikimedia.org/T234374) (owner: 10Eevans) [17:08:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 57.64 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:11:25] this seems to be due to a traffic spike in eqsin some minutes ago ^ [17:11:28] https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&panelId=2&fullscreen&var-site=eqsin&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=1570722443946&to=1570727438635 [17:11:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 81.81 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:11:52] nothing particularly worrisome [17:19:50] (03PS1) 10Jforrester: tests: Declare strict types for the static test now HHVM is gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542178 (https://phabricator.wikimedia.org/T235142) [17:19:52] (03PS1) 10Jforrester: Wikibase: Don't check to shard wmgWBSharedCacheKey for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542179 (https://phabricator.wikimedia.org/T235142) [17:19:55] (03PS1) 10Jforrester: Don't check to shard static config cache for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542180 (https://phabricator.wikimedia.org/T235142) [17:19:57] (03PS1) 10Jforrester: Drop HHVM special-case for SVG converter, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542181 (https://phabricator.wikimedia.org/T235142) [17:19:59] (03PS1) 10Jforrester: Drop special-case for PHP7 in PHPAutoPrepend, now always used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542182 (https://phabricator.wikimedia.org/T235142) [17:20:01] (03PS1) 10Jforrester: Drop nutcracker indirection for HHVM servers, just point to localhost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542183 (https://phabricator.wikimedia.org/T235142) [17:20:03] (03PS1) 10Jforrester: Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) [17:20:05] (03PS1) 10Jforrester: Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) [17:31:58] 10Operations, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Patch-For-Review, 10User-Ladsgroup, and 2 others: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) @jhsoby, is this finished yet? I'm just looking to see if (a) I should clear the task identifier from the "closed"... [17:35:55] (03CR) 10Jhedden: [C: 04-1] "labstore1004 and labstore1005 are both using `openstack::clientpackages::mitaka::jessie` unfortunately" [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [17:40:43] (03CR) 10Jforrester: [C: 03+2] tests: Declare strict types for the static test now HHVM is gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542178 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [17:41:32] (03Merged) 10jenkins-bot: tests: Declare strict types for the static test now HHVM is gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542178 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [17:49:32] (03PS2) 10Andrew Bogott: nova-fullstack tests: add a test for dns cleanup post-delete [puppet] - 10https://gerrit.wikimedia.org/r/542170 (https://phabricator.wikimedia.org/T235129) [17:58:06] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack tests: add a test for dns cleanup post-delete [puppet] - 10https://gerrit.wikimedia.org/r/542170 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [17:58:08] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:33] PROBLEM - Host ps1-b1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:03:00] /ac/ac [18:13:23] ACKNOWLEDGEMENT - SSH mw1290.mgmt on mw1290.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T234153 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:25] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) @Jclark-ctr checked on this. (Thanks!) but this still needs to happen. One minute i could SSH to it just fine and 12 minutes later it was alerting in Icinga again. So it keeps being "from time to time" an... [18:23:02] (03PS1) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on any distro [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) [18:23:56] (03CR) 10Dzahn: "well..i also did not want to have to change the mail command line used by the scripts. https://gerrit.wikimedia.org/r/542191" [puppet] - 10https://gerrit.wikimedia.org/r/541967 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [18:26:52] (03CR) 10Dzahn: "yes, i see "It is intended to provide" [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [18:39:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) a:05Cmjohnson→03Jclark-ctr So the labstore1003-array[123] are all causing report erros on https://netbox.wikimedia.org/extras/rep... [18:40:42] (03PS1) 10Paladox: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 [18:49:31] (03PS2) 10Paladox: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 [18:52:20] (03PS3) 10Paladox: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 [18:53:45] (03PS4) 10Paladox: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 [18:54:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [18:55:13] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10bd808) a:03bd808 [18:55:19] (03PS5) 10Paladox: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 [19:00:05] marxarelli: How many deployers does it take to do MediaWiki train - American version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T1900). [19:01:33] (03PS1) 10Reedy: Remove debug line from wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542195 [19:02:38] (03PS1) 10Dduvall: Revert "Rollback labswiki to 1.34.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542196 [19:02:40] (03CR) 10Dduvall: [C: 03+2] Revert "Rollback labswiki to 1.34.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542196 (owner: 10Dduvall) [19:03:30] (03Merged) 10jenkins-bot: Revert "Rollback labswiki to 1.34.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542196 (owner: 10Dduvall) [19:04:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:56] !log promoting labswiki to 1.35.0-wmf.1 cc: T233849 [19:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:59] T233849: 1.35.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T233849 [19:05:00] bd808: ^ [19:05:11] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:08:15] Reedy: Ouch. :-) [19:09:04] (03CR) 10Ottomata: [C: 03+2] Release Spark 2.4.3 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/532455 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [19:09:37] !log dduvall@deploy1001 Synchronized php-1.35.0-wmf.1/extensions/OpenStackManager: labswiki to 1.35.0-wmf.1 (duration: 01m 00s) [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:14] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: labswiki to 1.35.0-wmf.1 [19:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:40] marxarelli: sweet! I'll do some testing of the patch you backported for us [19:12:27] +1 [19:13:56] no labswiki errors in logstash thus far. i'll go ahead with group1 [19:14:33] er, all wikis [19:14:43] it's thursday, marxarelli [19:14:48] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10bd808) 05Open→03Resolved `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database nqowiki --debug $... [19:15:33] (03PS1) 10Andrew Bogott: cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) [19:15:35] (03PS1) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:15:39] marxarelli: +1 Basic wiki stuff looks fine on wikitech [19:15:58] awesome. thanks for testing! [19:16:50] (03PS1) 10Dduvall: all wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542203 [19:16:52] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542203 (owner: 10Dduvall) [19:17:05] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces ge-5/0/9] - description kafka2001; + description logstash2000; ` ` papaul@asw-b-codfw# show | compare [edit interfaces... [19:17:47] (03CR) 10jerkins-bot: [V: 04-1] cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:18:04] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542203 (owner: 10Dduvall) [19:18:09] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:19:51] 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10RobH) Please note that the netbox mis-match caused netbox reporting errors. To fix this I have done the following: * renamed kafka1001 to logstash1020 in netbox, switch port descriptio... [19:20:17] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.1 [19:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:31] (03PS2) 10Andrew Bogott: cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) [19:22:33] (03PS2) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:24:50] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:25:13] (03PS1) 10CDanis: secret: varnish: dummy acl bot_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/542206 [19:25:28] !log swift codfw-prod: more weight to ms-be205[1-6] - T233638 [19:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:31] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [19:25:46] 10Operations, 10wikitech.wikimedia.org, 10MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) 05Open→03Resolved a:03Andrew @Andrew and I managed to... [19:27:47] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10RobH) Please note that the hostname mismatch in netbox versus puppet was causing reporting errors on the puppetdb netbox report. To fix this, I have done the following: * renamed kafka... [19:27:50] (03PS3) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:28:42] thank you marxarelli — looks good! [19:29:04] +1. our patch works and it's not a live hack anymore :) [19:29:10] (03PS5) 10Paladox: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 [19:29:11] !log promoted 1.35.0-wmf.1 to all wikis. no rise in errors rates. no new relevant errors cc: T233849 [19:29:13] (03PS3) 10Paladox: Gerrit: Lower TTL to 300 [dns] - 10https://gerrit.wikimedia.org/r/541393 [19:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:15] T233849: 1.35.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T233849 [19:29:18] (03PS3) 10Paladox: Switch gerrit.wikimedia.org backend to gerrit1001 [dns] - 10https://gerrit.wikimedia.org/r/541111 [19:29:19] andrewbogott, bd808 np! thanks for the fix [19:30:08] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:30:24] (03Abandoned) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [19:30:26] (03PS4) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:32:38] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:33:29] !log swift eqiad-prod: add weight to ms-be105[1-6] - T232367 [19:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:33] T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 [19:34:13] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [19:34:35] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [19:37:53] (03PS1) 10Andrew Bogott: Added dummy password for cloud puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/542207 [19:38:03] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added dummy password for cloud puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/542207 (owner: 10Andrew Bogott) [19:41:48] (03PS1) 10Andrew Bogott: Added snakeoil certs for cloud puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/542209 [19:42:35] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added snakeoil certs for cloud puppetmasters [labs/private] - 10https://gerrit.wikimedia.org/r/542209 (owner: 10Andrew Bogott) [19:44:06] anyone reported issues with CORS POST requests to the API lately? I seem to get back a help page ignoring my POST data .... I might be doing something else wrong though [19:44:43] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1001/18833/cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [19:45:32] (03PS5) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:46:23] (03PS6) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [19:51:20] nevermind, i figured it out [19:51:42] HHVM (?) was not picky about Content-Type: x-www-form-urlencoded on POSTs, but PHP 7.2 is (?) [19:51:59] i was accidentally leaving it off and defaulting to text/plain which it no longer likes [19:58:00] (03CR) 10Krinkle: [C: 04-1] "LGTM, but would prefer the puppet code be removed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [19:58:21] (03CR) 10Krinkle: [C: 03+1] Wikibase: Don't check to shard wmgWBSharedCacheKey for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542179 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:03:37] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10EYener) Hi @herron thanks for the ping. Please let me know if this is what you need: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBLrqCmWRtknGqk3RLECT... [20:04:27] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) 05Resolved→03Open I should not have resolved this. The existing netbox entry is the old pdu, which needs to be renamed to the asset tag and removed from the rack in net... [20:04:31] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [20:09:16] (03CR) 10Krinkle: [C: 03+1] "Perhaps we could/should include PHP_VERSION instead. This would mean during minor upgrades and other such things we allow wmf-config and/o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542180 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:09:24] (03CR) 10Krinkle: [C: 03+1] Drop HHVM special-case for SVG converter, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542181 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:09:43] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:09:47] (03CR) 10Krinkle: [C: 03+1] Drop special-case for PHP7 in PHPAutoPrepend, now always used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542182 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:10:33] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:11:13] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:11:38] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10mepps) Good question @Aklapper! It is environmental sustainability. I was going for brevity, but I'm open to environment-sustainability@lists.wikimedia.org if it makes more sense. [20:11:58] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:12:27] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10JHedden) After the Newton upgrade the Stretch hosts areall running the distro version: `systemd-sysv 232-25+deb9u11` * cloudcontrol[1003-1004].wikimedia.org * cloudnet[1003-1004... [20:13:13] (03PS3) 10Cwhite: Update probe endpoint to support path and spec_segment [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/541683 [20:13:44] !log Updated the Wikidata property suggester with data from the 2019-09-30 JSON dump and applied the T132839 workarounds [20:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:50] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [20:14:10] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [20:15:11] (03PS2) 10CDanis: vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 [20:15:34] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10fgiunchedi) Thanks for reaching out @jcrespo, happy to help brainstorming on monitoring and which metrics make sense for this use case. Can do either on task or hangout for h... [20:16:11] 10Operations, 10Traffic, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10fgiunchedi) +1 on at least a week's worth of data [20:16:51] (03CR) 10CDanis: [C: 03+2] secret: varnish: dummy acl bot_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/542206 (owner: 10CDanis) [20:17:03] (03CR) 10CDanis: [V: 03+2 C: 03+2] secret: varnish: dummy acl bot_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/542206 (owner: 10CDanis) [20:19:10] 10Operations: Onboarding Reuven - https://phabricator.wikimedia.org/T235215 (10Dzahn) [20:21:56] (03PS2) 10Krinkle: Drop nutcracker indirection for HHVM servers, just point to localhost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542183 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:22:39] (03CR) 10Krinkle: [C: 03+1] "It's additional indirection to setup, but in terms of perf it's more direct and was done as optimisation. I've filed T235216 about reconsi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542183 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:24:07] (03CR) 10Krinkle: [C: 03+1] "LGTM, but would like to be around when staging this to test it thoroughly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [20:27:10] (03Abandoned) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [20:30:09] RECOVERY - Host ps1-b1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [20:30:37] (03CR) 10Jhedden: [C: 03+1] "nice addition" [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [20:33:47] PROBLEM - ps1-b1-eqiad-infeed-load-tower-B-phase-Z on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:05] PROBLEM - ps1-b1-eqiad-infeed-load-tower-B-phase-X on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:09] PROBLEM - ps1-b1-eqiad-infeed-load-tower-A-phase-Y on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:15] PROBLEM - ps1-b1-eqiad-infeed-load-tower-A-phase-Z on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:43] PROBLEM - ps1-b1-eqiad-infeed-load-tower-B-phase-Y on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:49] PROBLEM - ps1-b1-eqiad-infeed-load-tower-A-phase-X on ps1-b1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:36:20] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10Varnent) That works - thank you @herron. [20:36:35] !log otto@deploy1001 Started deploy [analytics/refinery@9b322e4]: attempting to fix missing git fat jar on stat1004 [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:40] (03PS3) 10CDanis: vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 [20:36:41] !log otto@deploy1001 Finished deploy [analytics/refinery@9b322e4]: attempting to fix missing git fat jar on stat1004 (duration: 00m 06s) [20:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:31] (03PS2) 10Jeena Huneidi: Update mediawiki-dev chart README [deployment-charts] - 10https://gerrit.wikimedia.org/r/542011 (https://phabricator.wikimedia.org/T222494) [20:44:43] (03CR) 10CDanis: "In the vagrant environment on my workstation, I get 30-second timeout failures in text/01-w.wiki-shortener.vtc and text/07-backend-misspas" [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [20:57:34] (03PS1) 10Ottomata: Spark 2.4.4 release [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/542225 (https://phabricator.wikimedia.org/T222253) [21:03:08] 10Operations, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Patch-For-Review, 10User-Ladsgroup, and 2 others: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jhsoby) The import is done as far as I can tell. There may be a few remnant pages that were not categorized properly, but I t... [21:04:38] (03CR) 10BryanDavis: [C: 03+1] tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) (owner: 10Phamhi) [21:06:40] (03CR) 10Andrew Bogott: "Vm spot-checks show only noops: https://puppet-compiler.wmflabs.org/compiler1001/18846/" [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [21:08:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:09:22] (03CR) 10CDanis: "Rerunning tests against cp1077 (instead of cp4027):" [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [21:10:30] (03CR) 10Andrew Bogott: [C: 03+1] "cloud host spot-checks also show no-op. https://puppet-compiler.wmflabs.org/compiler1001/18848/" [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [21:10:37] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:19:46] I'm grabbing the conch. [21:23:43] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): package wikimedia-lvs-realserver for buster - https://phabricator.wikimedia.org/T235140 (10Dzahn) Oh, that was quick and easier than i thought. Thank you! [21:24:13] (03CR) 10Ema: [C: 03+1] vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [21:30:50] 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew) [21:33:10] 04Critical Alert for device ps1-b1-eqiad.mgmt.eqiad.wmnet - Device rebooted [21:38:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-b1-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [21:38:17] 10Operations: Puppet breakage in automation-feedback VMs - https://phabricator.wikimedia.org/T234452 (10Andrew) [21:39:25] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.1/extensions/VisualEditor/lib/ve/src/dm/ve.dm.TreeCursor.js: T234881 TreeCursor: cross ignored nodes properly from the end of a text node (duration: 00m 54s) [21:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:29] T234881: TreeModifier: remover does not skip over a deleted node immediately following a text node - https://phabricator.wikimedia.org/T234881 [21:40:38] (03PS4) 10CDanis: vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 [21:40:53] (03CR) 10CDanis: [C: 03+2] vcl: support new ACL bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [21:45:53] (03CR) 10Jforrester: [C: 03+2] Wikibase: Don't check to shard wmgWBSharedCacheKey for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542179 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:46:42] (03Merged) 10jenkins-bot: Wikibase: Don't check to shard wmgWBSharedCacheKey for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542179 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:47:35] (03CR) 10Jforrester: [C: 03+2] Don't check to shard static config cache for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542180 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:48:13] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Don't check to shard wmgWBSharedCacheKey for HHVM any more (duration: 00m 51s) [21:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:21] (03Merged) 10jenkins-bot: Don't check to shard static config cache for HHVM any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542180 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:49:40] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Don't check to shard static config cache for HHVM any more (duration: 00m 50s) [21:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:32] (03CR) 10Jforrester: [C: 03+2] Drop HHVM special-case for SVG converter, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542181 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:53:19] (03Merged) 10jenkins-bot: Drop HHVM special-case for SVG converter, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542181 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:55:42] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Drop HHVM special-case for SVG converter, no longer used (duration: 00m 51s) [21:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:10] (03CR) 10Jforrester: [C: 03+2] Drop special-case for PHP7 in PHPAutoPrepend, now always used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542182 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:56:35] (03PS1) 10Dzahn: phabricator: fix duplicate installation of PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/542240 [21:56:58] (03Merged) 10jenkins-bot: Drop special-case for PHP7 in PHPAutoPrepend, now always used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542182 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [21:58:13] (03Abandoned) 10Dzahn: phabricator: fix duplicate installation of PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/542240 (owner: 10Dzahn) [21:58:57] !log jforrester@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: Drop special-case for PHP7, now always used (duration: 00m 51s) [21:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:14] (03PS2) 10Dzahn: phabricator: fix duplicate installation of PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/541993 (https://phabricator.wikimedia.org/T190568) [21:59:16] (03CR) 10Jforrester: [C: 03+2] "Thanks for the context." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542183 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [22:00:08] integration-config-zuul-layout-diff-docker for 542235 no works 18 minutes already, can anyone restart the job or stop as patch is merged already [22:00:14] (03Merged) 10jenkins-bot: Drop nutcracker indirection for HHVM servers, just point to localhost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542183 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [22:04:18] (03PS2) 10Jforrester: Drop HHVM XHProf and Arclamp code, no longer called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) [22:04:43] !log jforrester@deploy1001 Synchronized wmf-config/mc.php: Drop nutcracker indirection for HHVM servers, just point to localhost (duration: 00m 51s) [22:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:29] (03CR) 10Jforrester: "Happy for this to be deployed whenever. Will leave for you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542185 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [22:05:51] (03PS2) 10Jforrester: Remove debug line from wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542195 (owner: 10Reedy) [22:07:18] (03CR) 10Jforrester: [C: 03+2] Remove debug line from wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542195 (owner: 10Reedy) [22:07:34] (03PS2) 10Jforrester: [Beta Cluster] Enable wmgUseCSPReportOnly for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) [22:07:39] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Enable wmgUseCSPReportOnly for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester) [22:08:06] (03Merged) 10jenkins-bot: Remove debug line from wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542195 (owner: 10Reedy) [22:08:42] (03Merged) 10jenkins-bot: [Beta Cluster] Enable wmgUseCSPReportOnly for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester) [22:08:56] (03PS6) 10Dzahn: Phabricator: Remove support for mod_php and default to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [22:09:10] (03CR) 10Dzahn: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [22:10:38] !log jforrester@deploy1001 Synchronized wmf-config/wikitech.php: Remove debug line dating from 2015-12-08! (duration: 00m 51s) [22:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:53] (03CR) 10Dzahn: [C: 03+2] "looking good in compiler. thanks! https://puppet-compiler.wmflabs.org/compiler1001/18851/" [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [22:11:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18851/" [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [22:11:38] (03PS2) 10Jforrester: Add 'periodical' as run mode to $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [22:14:43] (03CR) 10Jforrester: [C: 03+2] Add 'periodical' as run mode to $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [22:15:46] (03Merged) 10jenkins-bot: Add 'periodical' as run mode to $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [22:17:34] 10Operations: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [22:17:41] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T78711 Update cron-updated miser pages to say they are run periodically, not never (duration: 00m 51s) [22:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] 10Operations: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) p:05Triage→03High [22:17:45] T78711: querypage-no-updates still shown on special pages on wmf wikis that update from cron - https://phabricator.wikimedia.org/T78711 [22:17:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [22:19:18] (03Abandoned) 10Dzahn: phabricator: fix duplicate installation of PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/541993 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:20:24] (03CR) 10Jforrester: "It works. Success. Let's dust off the messages patch and get that landed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [22:21:05] (03Abandoned) 10Jforrester: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539850 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [22:21:20] (03PS3) 10Jforrester: beta: noop: remove unused Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663) (owner: 10Pmiazga) [22:21:43] (03CR) 10Jforrester: [C: 03+2] beta: noop: remove unused Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663) (owner: 10Pmiazga) [22:22:05] (03CR) 10Dzahn: "thanks, this fixed it on phab1003 and phab2001 and saved me some time trying to fix it while keeping the mod_php support. but you are righ" [puppet] - 10https://gerrit.wikimedia.org/r/542193 (owner: 10Paladox) [22:22:17] (03PS2) 10Jforrester: Remove old and unused MFMobileFormatterHeadings config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540123 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [22:22:40] beta-mediawiki-config-update-eqiad should be checked... https://integration.wikimedia.org/zuul [22:22:43] Jforrester, Reedy... [22:22:51] (03Merged) 10jenkins-bot: beta: noop: remove unused Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663) (owner: 10Pmiazga) [22:23:00] Oops... James_F: [22:23:19] What about beta-mediawiki-config-update-eqia? [22:23:43] There is much in queue for postmerge, ~30 minutes [22:23:51] (03CR) 10Jforrester: [C: 03+2] Remove old and unused MFMobileFormatterHeadings config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540123 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [22:24:10] What's that got to do with beta-mediawiki-config-update-eqiad [22:24:15] And no, that queue amount isn't problematic [22:24:39] (03Merged) 10jenkins-bot: Remove old and unused MFMobileFormatterHeadings config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540123 (https://phabricator.wikimedia.org/T232690) (owner: 10Pmiazga) [22:24:57] Mostly the issue is the codehealth post-merge job, which we should re-evaluate. [22:25:16] I'll talk to Kosta at TechConf about it. [22:25:29] Last call for any random config patches! [22:26:09] Okay, I wouldn't argue, but it's too long for half an hour for patches related to operations/mediawiki-config be in queue for postmerge, because it usually finish instantaneously... [22:26:34] They run quickly, but they're limited to sequential running. [22:26:39] I could tweak that, but eh. [22:26:45] It's just Beta Cluster. [22:26:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgMFMobileFormatterHeadings, unread T232690 (duration: 00m 51s) [22:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:03] T232690: Skip Expensive MobileFormatter Transformations On Pages With Extremely High Image/Heading counts - https://phabricator.wikimedia.org/T232690 [22:32:45] OK, I'm closing open-season. As you were. [22:33:32] (03PS2) 10Dzahn: phabricator: install s-nail instead of heirloom-mailx on any distro [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) [22:34:17] (03CR) 10Dzahn: "used by community_metrics" [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:34:32] (03CR) 10Dzahn: [C: 03+2] phabricator: install s-nail instead of heirloom-mailx on any distro [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:39:04] (03CR) 10Dzahn: "noop on buster and stretch. s-nail already installed." [puppet] - 10https://gerrit.wikimedia.org/r/542191 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:41:20] (03PS4) 10Dzahn: Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [22:43:42] (03CR) 10Dzahn: Gerrit: Disable auto reloading replication config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [22:43:48] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [22:45:02] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [22:45:06] 10Operations, 10serviceops, 10Patch-For-Review: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [22:45:22] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [22:49:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:51:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:53:14] (03CR) 10Dzahn: [C: 03+2] Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [22:56:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:57:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:58:39] (03CR) 10Krinkle: [C: 03+1] scap: mediawiki logstash_checker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191010T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:02:18] (03PS5) 10Thcipriani: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) [23:03:20] (03CR) 10Thcipriani: scap: mediawiki logstash_checker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [23:10:28] (03PS1) 10Filippo Giunchedi: prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 [23:10:39] mutante: I think ^ is the culprit [23:10:51] for the over reporting of % failed agents [23:11:06] (03PS1) 10Papaul: DNS: Remoce mgmt DNS for db2058 and db2069 [dns] - 10https://gerrit.wikimedia.org/r/542252 [23:11:07] ah:) nice [23:12:24] (03CR) 10jerkins-bot: [V: 04-1] prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [23:13:42] (03CR) 10Dzahn: [C: 03+1] "not super familiar with this but makes sense to me to turn that into a boolean and yes, we do want to fix the over reporting on this." [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [23:14:02] (03PS2) 10Filippo Giunchedi: prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 [23:14:15] thanks ! [23:15:55] (03CR) 10jerkins-bot: [V: 04-1] prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [23:16:32] (03CR) 10Dzahn: [C: 03+1] "actual change looks fine, just typo in the commit message (Remoce)" [dns] - 10https://gerrit.wikimedia.org/r/542252 (owner: 10Papaul) [23:16:46] (03PS3) 10Filippo Giunchedi: prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 [23:17:37] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Lake Huron missing due to apparent OSM vandalism - https://phabricator.wikimedia.org/T231691 (10RobinLeicester) I am currently seeing Lake Huron entirely white on most zooms, and with some white tiles on zooms above 9. (on Wikimedia maps: https://m... [23:19:19] (03PS4) 10Dzahn: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [23:19:31] mutante that's not to be merged yet :) [23:19:50] oh, you only updated the commit msg. [23:20:08] (03PS12) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [23:20:10] (03PS13) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) [23:20:11] yes, and topic branch. to point that out [23:20:42] thanks! [23:20:49] (03CR) 10Dzahn: "@Thcipriani This should be merged when we upgrade to 2.15.17. using a new topic for that." [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [23:21:08] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for db2058 and db2069 [dns] - 10https://gerrit.wikimedia.org/r/542252 (owner: 10Papaul) [23:21:51] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for db2058 and db2069 [dns] - 10https://gerrit.wikimedia.org/r/542252 (owner: 10Papaul) [23:27:40] (03PS13) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [23:27:42] (03PS14) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) [23:29:07] (03CR) 10Filippo Giunchedi: swift: add swiftrepl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [23:34:44] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Reedy) [23:45:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1