[00:02:01] PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-10-03 23:32:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:33:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [00:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) ` papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-3/0/16 { - description "restbase1010 1G"; - } - ge-3/0/17 { - description "restb... [00:55:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) [00:58:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) [02:10:36] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 948.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:38:24] PROBLEM - snapshot of s5 in eqiad on db1115 is CRITICAL: snapshot for s5 at eqiad taken more than 4 days ago: Most recent backup 2019-10-04 02:11:25 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:55:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [02:57:56] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott Im investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [03:03:59] !log restarted nova-conductor on cloudcontrol1003 and cloudcontrol1004 — experimental band-aid for T234876 [03:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:03] T234876: nova-conductor running out of mysql connections - https://phabricator.wikimedia.org/T234876 [03:06:30] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [03:17:29] (03PS1) 10Andrew Bogott: nova: try to reduce the number of db connections [puppet] - 10https://gerrit.wikimedia.org/r/541407 (https://phabricator.wikimedia.org/T234876) [03:21:38] (03CR) 10Andrew Bogott: [C: 03+2] "I don't know if this is a good idea, but at least the patch does what I intended." [puppet] - 10https://gerrit.wikimedia.org/r/541407 (https://phabricator.wikimedia.org/T234876) (owner: 10Andrew Bogott) [03:21:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) [03:22:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) ` papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-4/0/30 { - description restbase1007; - } [03:23:24] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:25:11] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul) ` papaul@asw2-a-eqiad# show | compare [edit interfaces] - ge-4/0/17 { - description rhenium; - } [03:25:29] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10Papaul) [03:28:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-7/0/34 { - description lithium; - } [03:29:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission lithium - https://phabricator.wikimedia.org/T229557 (10Papaul) [03:31:12] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-7/0/4 { - description dbproxy1009; - } [03:31:50] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) [03:34:06] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:34:48] (03PS13) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [03:34:50] (03PS17) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:34:52] (03PS14) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [03:34:54] (03PS9) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [03:34:56] (03PS9) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [03:34:58] (03PS9) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [03:36:00] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:38:00] (03CR) 10jerkins-bot: [V: 04-1] query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:40:06] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) No switch port reference for kafka1014 and kafka1022 on asw2-c-eqiad or asw-c-eqaid [03:41:24] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) [03:46:17] (03PS18) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:46:19] (03PS15) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [03:46:21] (03PS10) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [03:46:23] (03PS10) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [03:46:25] (03PS10) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [03:51:35] (03CR) 10Mathew.onipe: "> Patch Set 12: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:59:04] (03PS19) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:59:06] (03PS16) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [03:59:08] (03PS11) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [03:59:10] (03PS11) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [03:59:12] (03PS11) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [04:23:54] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2019-10-04 04:11:37 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:00:33] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10elukey) 05Open→03Resolved >>! In T232069#5553714, @wiki_willy wrote: > Thanks @elukey . Should we ignore/reso... [05:03:21] (03PS3) 10Marostegui: wikireplica_analytics: Change query killer from 4h to 1h [puppet] - 10https://gerrit.wikimedia.org/r/541257 (https://phabricator.wikimedia.org/T233986) [05:07:44] !log Deploy schema change on db1097:3315 - T233625 [05:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:54] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [05:08:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315 T233625', diff saved to https://phabricator.wikimedia.org/P9252 and previous config saved to /var/cache/conftool/dbconfig/20191008-050833-marostegui.json [05:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:11] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Vgutierrez) we can depool it just before shutting it down, just let us know when you want to do it [05:09:31] (03CR) 10Marostegui: [C: 03+2] "As per this comment T233986#5553960 from bd808 I am going to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/541257 (https://phabricator.wikimedia.org/T233986) (owner: 10Marostegui) [05:10:35] !log Reload query killer on labsdb1011 [05:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 for schema change', diff saved to https://phabricator.wikimedia.org/P9253 and previous config saved to /var/cache/conftool/dbconfig/20191008-051435-marostegui.json [05:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:45] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/541411 (https://phabricator.wikimedia.org/T233986) [05:24:33] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/541411 (https://phabricator.wikimedia.org/T233986) (owner: 10Marostegui) [05:25:33] !log Depool labsdb1011 for mysql upgrade [05:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:26] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:30:38] ^ expected [05:31:48] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:31:54] ^ expected too [05:33:38] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:35:20] !log drop CitationUsage tables from the log database on db1107/db1108 (the ones listed in the task) - T233893 [05:35:25] marostegui: --^ [05:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:25] o/ [05:35:26] T233893: drop CitatitionUsage data on mysql - https://phabricator.wikimedia.org/T233893 [05:35:30] elukey: <3!!!! [05:35:47] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/541412 [05:40:32] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:26] PROBLEM - eventlogging_sync processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [05:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 db1081 db1080 db1079 db1075 db1074 for PDU maintenance T227138', diff saved to https://phabricator.wikimedia.org/P9254 and previous config saved to /var/cache/conftool/dbconfig/20191008-054127-marostegui.json [05:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:32] T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 [05:41:44] elukey: I guess db1108 is also you? ^ [05:42:45] marostegui: checking [05:43:35] yep my bad sorry [05:43:46] RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:53] no worries! [05:43:57] thanks for fixing it [05:44:02] !log drop PageCreation_7481635 table from the log db on db1107/db1108 - T233892 [05:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:06] T233892: Drop page create event data on mysql - https://phabricator.wikimedia.org/T233892 [05:44:40] RECOVERY - eventlogging_sync processes on db1108 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [05:48:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:23] (03PS1) 10Marostegui: site.pp: Remove references to db2058 [puppet] - 10https://gerrit.wikimedia.org/r/541413 (https://phabricator.wikimedia.org/T229543) [05:48:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:36] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2058.codfw.wmnet` - db2058.codfw.wmnet (**PASS**)... [05:49:49] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove references to db2058 [puppet] - 10https://gerrit.wikimedia.org/r/541413 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [05:49:59] (03PS1) 10Marostegui: wmnet: Remove db2058 production DNS [dns] - 10https://gerrit.wikimedia.org/r/541414 (https://phabricator.wikimedia.org/T229543) [05:50:58] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2058 production DNS [dns] - 10https://gerrit.wikimedia.org/r/541414 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [05:51:06] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:51:09] 10Operations, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Patch-For-Review, 10User-Ladsgroup, and 2 others: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ajit_Kumar_Tiwari) @Dcljr: no pages will be overwritten when importing is done because all are new. We are already keeping th... [05:51:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) a:05RobH→03Papaul [05:52:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) Host ready final decommissioning steps + switch disablement [06:07:47] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/541412 (owner: 10Marostegui) [06:07:59] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/541412 [06:09:01] !log Repool labsdb1011 for mysql upgrade [06:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:53] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10elukey) 05Resolved→03Open ` elukey@re0.cr2-esams> show bgp summary | match 12871 80.249.208.32 12871 0 0 0 1 1w3d20h Active 2001:7f8:1::a501:2871:2 12871... [06:18:44] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Joe) >>! In T223393#5553766, @Jdforrester-WMF wrote: > If this isn't done before tomorrow, the train rollout will break wikitechwiki... [06:22:36] (03PS2) 10Elukey: reportupdater:manifests:job.pp: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [06:22:54] (03PS3) 10Elukey: reportupdater: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [06:23:01] (03PS4) 10Elukey: reportupdater: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [06:23:12] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:47] (03CR) 10Elukey: "This seems safe to merge: https://puppet-compiler.wmflabs.org/compiler1002/18774/" [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [06:48:45] !log Stop MySQL on es011 db1082 db1081 db1080 db1079 db1075 db1074 (replication lag will appear on labs for s5) for on-site maintenance T227138 [06:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:49] (03PS1) 10DCausse: [cirrus] drop support for HHVM connection pooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541425 [06:48:49] T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 [07:00:56] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) @Cmjohnson the following hosts are good to go: db1082 db1081 db1080 db1079 db1075 db1074 es1011 Please note: - db1074 has been powered off as it has a broken PSU, so p... [07:04:04] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:08] (03PS1) 10Alexandros Kosiaris: admin: Add view clusterrolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541426 [07:10:33] !log draining ganeti1002 for upcoming reboot (combined kernel/qemu security updates) [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but I am curious, how is the dev tag updated?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [07:13:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update wikifeeds chart to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [07:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P9255 and previous config saved to /var/cache/conftool/dbconfig/20191008-071551-marostegui.json [07:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:19:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P9256 and previous config saved to /var/cache/conftool/dbconfig/20191008-071859-marostegui.json [07:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:32] (03PS2) 10Alexandros Kosiaris: admin: Add view clusterrolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541426 [07:21:11] (03PS3) 10Alexandros Kosiaris: admin: Add view clusterrolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541426 (https://phabricator.wikimedia.org/T207200) [07:23:20] (03PS6) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [07:25:29] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [07:29:04] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:29:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:30:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:46] (03CR) 10Mobrovac: [C: 03+1] Update wikifeeds chart to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [07:38:58] 10Operations, 10Traffic: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) [07:39:18] 10Operations, 10Traffic: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) p:05Triage→03Normal [07:39:26] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:39:44] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:41:02] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:41:20] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:44:28] (03CR) 10Effie Mouzeli: [C: 03+2] Remove tmpreaper from mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) (owner: 10Muehlenhoff) [07:44:37] (03PS3) 10Effie Mouzeli: Remove tmpreaper from mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) (owner: 10Muehlenhoff) [07:49:25] !log update OTRS to 5.0.38 [07:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:14] !log draining ganeti1003 for upcoming reboot (combined kernel/qemu security updates) [07:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:10] (03CR) 10Gehel: [C: 04-1] "Much cleaner! Thanks! A few more comments inline." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [07:58:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add view clusterrolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541426 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [07:58:48] (03Merged) 10jenkins-bot: admin: Add view clusterrolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541426 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [08:03:37] ACKNOWLEDGEMENT - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-10-03 23:32:11 Jcrespo rerunning backups/prepare - The acknowledgement expires at: 2019-10-09 08:03:07. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:03:37] ACKNOWLEDGEMENT - snapshot of s5 in eqiad on db1115 is CRITICAL: snapshot for s5 at eqiad taken more than 4 days ago: Most recent backup 2019-10-04 02:11:25 Jcrespo rerunning backups/prepare - The acknowledgement expires at: 2019-10-09 08:03:07. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:03:37] ACKNOWLEDGEMENT - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2019-10-04 04:11:37 Jcrespo rerunning backups/prepare - The acknowledgement expires at: 2019-10-09 08:03:07. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:05:09] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::packages: remove packages for math rendering [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) (owner: 10Giuseppe Lavagetto) [08:07:55] (03PS1) 10Alexandros Kosiaris: admin: Fix typo with Group definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/541510 [08:08:07] (03PS2) 10Giuseppe Lavagetto: mediawiki::packages: remove packages for math rendering [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) [08:08:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Fix typo with Group definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/541510 (owner: 10Alexandros Kosiaris) [08:08:32] (03Merged) 10jenkins-bot: admin: Fix typo with Group definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/541510 (owner: 10Alexandros Kosiaris) [08:09:48] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:05] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:46] (03PS1) 10Jon Harald Søby: Enable more transwiki import sources for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541511 (https://phabricator.wikimedia.org/T234892) [08:31:05] (03PS1) 10Elukey: role::druid::analytics::worker: increase query timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/541512 (https://phabricator.wikimedia.org/T234684) [08:31:38] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: increase query timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/541512 (https://phabricator.wikimedia.org/T234684) (owner: 10Elukey) [08:33:08] !log roll restart druid historicals and brokers on druid100[1-3] to pick up new settings - T234684 [08:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:12] T234684: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 [08:38:05] !log mobrovac@deploy1001 Started deploy [restbase/deploy@83fcc0c]: Minor updates to VE logging [08:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:21] 10Operations, 10Traffic: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) It looks to me like this is some kind of timeout issue with POST requests, checking the output of `varnishlog -n frontend -q "FetchEr... [08:45:14] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:45:44] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [08:45:52] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:45:54] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [08:46:02] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:46:09] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@83fcc0c]: Minor updates to VE logging (duration: 08m 05s) [08:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:48] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:47:20] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [08:47:24] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:47:26] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [08:47:36] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:42] (03CR) 10Gehel: [C: 04-1] "PCC in error: https://puppet-compiler.wmflabs.org/compiler1001/18775/wdqs1004.eqiad.wmnet/change.wdqs1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:57:15] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:57:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:31] 10Operations, 10Traffic: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) on the ATS side it doesn't look like there is any timeout set to 60 seconds though: `vgutierrez@cp4027:~$ sudo -i traffic_ctl --run-r... [09:05:30] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10MoritzMuehlenhoff) I see in dmesg that it got removed from auth1001, but I don't see it in the logs for auth1002, is the USB slot in question maybe inactive? Could you try moving it to a different slot? [09:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3315 T233625', diff saved to https://phabricator.wikimedia.org/P9257 and previous config saved to /var/cache/conftool/dbconfig/20191008-090616-marostegui.json [09:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:23] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [09:09:39] !log draining ganeti1004 for upcoming reboot (combined kernel/qemu security updates) [09:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:20] !log Compress logging table on db2088:3312 for idwiki,plwiki,ptwiki,zhwiki [09:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:05] (03PS2) 10Alexandros Kosiaris: ORES: Make redis AOF configurable [puppet] - 10https://gerrit.wikimedia.org/r/540912 (https://phabricator.wikimedia.org/T233831) [09:26:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1088 after schema change', diff saved to https://phabricator.wikimedia.org/P9258 and previous config saved to /var/cache/conftool/dbconfig/20191008-092627-marostegui.json [09:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:22] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-10-08 07:55:13 from db1116.eqiad.wmnet:3317 (866 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:31:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] ORES: Make redis AOF configurable [puppet] - 10https://gerrit.wikimedia.org/r/540912 (https://phabricator.wikimedia.org/T233831) (owner: 10Alexandros Kosiaris) [09:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P9259 and previous config saved to /var/cache/conftool/dbconfig/20191008-093309-marostegui.json [09:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] (03PS1) 10Jcrespo: bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) [09:36:57] (03CR) 10jerkins-bot: [V: 04-1] bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:40:16] (03PS2) 10Jcrespo: bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) [09:44:42] (03PS3) 10Jcrespo: bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) [09:44:53] (03PS4) 10Jcrespo: bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) [09:45:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:46:40] RECOVERY - snapshot of s5 in eqiad on db1115 is OK: snapshot for s5 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-10-08 08:59:18 from db1102.eqiad.wmnet:3315 (666 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:56:41] (03CR) 10Jcrespo: [C: 03+2] bacula: Change pool/storage names for new bacula director [puppet] - 10https://gerrit.wikimedia.org/r/541517 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:05:47] (03CR) 10Muehlenhoff: "Some comments inline (on PS1, PS3 and this PS4) :-)" (035 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:08:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:08:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:08:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@00eda0b]: Parsoid VE logging: log if the etags differ [10:08:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:07] (03PS1) 10Alexandros Kosiaris: RBAC: Add an api-metrics ClusterRole and binding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541520 [10:13:38] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [10:13:38] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:13:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] RBAC: Add an api-metrics ClusterRole and binding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541520 (owner: 10Alexandros Kosiaris) [10:14:00] (03Merged) 10jenkins-bot: RBAC: Add an api-metrics ClusterRole and binding [deployment-charts] - 10https://gerrit.wikimedia.org/r/541520 (owner: 10Alexandros Kosiaris) [10:14:32] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [10:15:08] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:15:08] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:15:30] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@00eda0b]: Parsoid VE logging: log if the etags differ (duration: 06m 32s) [10:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:38] (03PS1) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 [10:16:04] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:16:36] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:16:38] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:16:46] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:16:48] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:16:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:16:53] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:19:10] !log draining ganeti1005 for upcoming reboot (combined kernel/qemu security updates) [10:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:26] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:21:15] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:04] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:22:05] argon complaining about etcd requests latencies is probably because of the ganeti moves [10:22:16] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:17] (03PS5) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [10:26:09] (03CR) 10Jbond: refactor: Refactor script and use the PyYAML (035 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:30:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:41:18] (03CR) 10Jbond: "about to merge however i just a note for history, i noticed this package already has a pyyaml dependency due to debdeploy_updatespec.py" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:41:27] (03CR) 10Jbond: [C: 03+2] refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:47:23] (03PS6) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [10:47:35] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) [10:47:44] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) p:05Triage→03High [10:48:27] (03CR) 10Jbond: [C: 03+2] refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [10:49:11] (03PS1) 10Jcrespo: bacula: Force install bacula-director, not a dependency on buster [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) [10:49:47] (03PS2) 10Jcrespo: bacula: Force install bacula-director, not a dependency on buster [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) [10:50:17] (03CR) 10Jcrespo: "¯\_(ツ)_/¯" [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:54:13] (03PS1) 10Vgutierrez: ATS: Match HTTP transaction activity timeout and TTFB timeouts [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [10:56:46] Starting Pdu swap eqiad A2 in 5 minutes https://phabricator.wikimedia.org/T227138 [10:57:11] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:57:11] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [10:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:19] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] !log testing ipmi reset cookbook. using the current pass for both old and new so no reset actully occures [10:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:00] !log jbond@cumin1001 Updating IPMI password on 1253 hosts - jbond@cumin1001 [10:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:30] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.ipmi-password-reset (exit_code=97) [10:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:41] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:47] !log jbond@cumin1001 Updating IPMI password on 1253 hosts - jbond@cumin1001 [10:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1100). [11:00:04] Jhs: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18781/" [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:00:14] I'm here! [11:00:25] I can SWAT today! [11:01:29] Jhs: was T234892 discussed on-wiki? [11:01:29] T234892: Enable more import sources for hiwikisource - https://phabricator.wikimedia.org/T234892 [11:01:44] Urbanecm, no [11:02:08] it just doesn't make sense that a new wikisource should not be able to import from oldwikisource, where it used to be located [11:03:35] Urbanecm, come to think of it, maybe a better change would be to add oldwikisource to the generic one for Wikisource, listed at the top of $wgImportSources? [11:04:23] o/ [11:05:15] Jhs: well, makes sense, +2'ing. [11:05:17] (03CR) 10Urbanecm: [C: 03+2] Enable more transwiki import sources for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541511 (https://phabricator.wikimedia.org/T234892) (owner: 10Jon Harald Søby) [11:05:45] thanks Urbanecm :) [11:06:07] (03Merged) 10jenkins-bot: Enable more transwiki import sources for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541511 (https://phabricator.wikimedia.org/T234892) (owner: 10Jon Harald Søby) [11:08:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: fb49404: Enable more transwiki import sources for hiwikisource (T234892) (duration: 00m 55s) [11:08:02] (03PS1) 10Vgutierrez: ATS: Honour 180 secs timeout on backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541525 (https://phabricator.wikimedia.org/T234887) [11:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:06] T234892: Enable more import sources for hiwikisource - https://phabricator.wikimedia.org/T234892 [11:08:13] Jhs: done [11:08:18] Lucas_WMDE: you want to do your stuff? [11:09:19] thank you Urbanecm [11:09:23] (03PS2) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 [11:09:27] yw Jhs [11:10:02] I have nothing to do [11:10:12] o/ is just my signal that I’m here and available :) [11:10:12] (03CR) 10jerkins-bot: [V: 04-1] ATS: Honour 180 secs timeout on backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541525 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:10:17] (I was a bit late this time, sorry) [11:10:28] Urbanecm: feel free to close the SWAT [11:10:57] !log EU SWAT done [11:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:00] thanks Lucas_WMDE [11:12:31] (03PS2) 10Vgutierrez: ATS: Honour 180 secs timeout on backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541525 (https://phabricator.wikimedia.org/T234887) [11:12:41] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones pup... [11:14:34] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-10-08 09:38:56 from db2099.codfw.wmnet:3314 (1087 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:19:15] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [11:22:24] (03PS3) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 [11:23:50] (03PS1) 10Arturo Borrero Gonzalez: CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) [11:26:24] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy https://puppet-compiler.wmflabs.org/compiler1001/18786/" [puppet] - 10https://gerrit.wikimedia.org/r/541525 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:29:08] (03PS4) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 [11:33:46] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) 05Open→03Resolved a:03jijiki I think we can mark this as resolved, tmpreaper will be going away as we are reimaging mediawiki servers [11:33:50] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jijiki) [11:37:58] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [11:38:26] (03Abandoned) 10Vgutierrez: ATS: Honour 180 secs timeout on backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541525 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:39:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:47] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) [11:54:19] (03PS2) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [11:54:50] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) 05Resolved→03Open See earlier discussion on task, this is still used by Toolforge, so WMCS SREs might still want to tweak the log spam. [11:54:53] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10MoritzMuehlenhoff) [11:56:19] (03CR) 10jerkins-bot: [V: 04-1] ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:56:40] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) @Milimetric the visual treatment depends on a few factors, although yes, I think we'll want a p... [11:57:33] (03PS3) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [11:58:39] (03PS4) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [11:59:33] (03PS1) 10Urbanecm: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) [11:59:47] (03PS1) 10Muehlenhoff: Add a minimal setup.py and switch to dh-python [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541528 [12:00:10] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [12:00:14] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [12:00:30] ^ I guess that's from the PDU maintenance? [12:01:08] marostegui it was plugged in but it looks like it has a bad PSU [12:01:14] PROBLEM - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:01:39] elukey: ^ [12:01:45] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18789/" [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [12:01:53] (03PS2) 10Muehlenhoff: Add a minimal setup.py and switch to dh-python [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541528 [12:02:09] (03PS2) 10Urbanecm: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) [12:02:27] marostegui: thanks :) [12:02:51] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [12:03:49] (03PS3) 10Urbanecm: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) [12:03:54] RECOVERY - Host an-worker1079 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:07:01] 10Operations, 10Wikimedia-Mailing-lists, 10Wikispore: Creation of Wikispore mailing list - https://phabricator.wikimedia.org/T232961 (10Pharos) Can we create it this week? This will be a vital tool in building up community discussion and participation, and we want to do a kind of public launch for the projec... [12:10:46] PROBLEM - IPMI Sensor Status on kafka-jumbo1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:11:32] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10Cmjohnson) @MoritzMuehlenhoff It should be working now [12:12:20] PROBLEM - Host ms-be1044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:12:24] PROBLEM - Host ms-be1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:13:30] PROBLEM - Host cloudelastic1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:13:34] PROBLEM - Host db1075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:12] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10MoritzMuehlenhoff) 05Open→03Resolved Confirmed, thanks. [12:14:26] PROBLEM - Host ms-be1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:34] PROBLEM - Host db1074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:15:06] PROBLEM - Host es1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:15:06] PROBLEM - Host kafka-jumbo1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:15:36] PROBLEM - Host an-worker1078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:15:36] PROBLEM - Host tungsten.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:15:36] PROBLEM - Host an-presto1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:16:14] PROBLEM - IPMI Sensor Status on es1012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:16:26] PROBLEM - Host db1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:18:02] RECOVERY - Host ms-be1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [12:18:06] RECOVERY - Host ms-be1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [12:18:32] 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) [12:18:36] RECOVERY - Host db1075.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [12:19:10] RECOVERY - Host cloudelastic1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [12:19:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541528 (owner: 10Muehlenhoff) [12:20:08] RECOVERY - Host ms-be1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:20:16] RECOVERY - Host db1074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [12:21:06] RECOVERY - Host es1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [12:21:06] RECOVERY - Host an-presto1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [12:21:06] RECOVERY - Host kafka-jumbo1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [12:21:06] RECOVERY - Host an-worker1078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [12:21:06] RECOVERY - Host tungsten.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [12:21:06] RECOVERY - Host db1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [12:23:06] RECOVERY - IPMI Sensor Status on kafka-jumbo1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:23:36] RECOVERY - IPMI Sensor Status on an-worker1079 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:24:01] (03PS1) 10Marostegui: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541530 [12:24:20] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541530 (owner: 10Marostegui) [12:24:28] thanks jynus [12:25:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541530 (owner: 10Marostegui) [12:26:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541530 (owner: 10Marostegui) [12:27:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1012 T227138 (duration: 00m 51s) [12:27:13] !log Stop MySQL on es1012 for onsite maintenance [12:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:16] T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 [12:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [12:34:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541531 [12:35:56] RECOVERY - IPMI Sensor Status on es1012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:36:53] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541531 (owner: 10Marostegui) [12:37:21] (03PS1) 10Matthias Mullie: Increase rate limits for newbie non-ip users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) [12:37:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541531 (owner: 10Marostegui) [12:37:47] 10Operations, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10MoritzMuehlenhoff) [12:37:49] (03CR) 10Matthias Mullie: [C: 04-2] "TBD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) (owner: 10Matthias Mullie) [12:38:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1012 T227138 (duration: 00m 51s) [12:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 [12:39:26] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541533 [12:40:23] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541533 [12:41:28] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541533 (owner: 10Marostegui) [12:42:15] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541533 (owner: 10Marostegui) [12:43:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool es1011 (duration: 00m 51s) [12:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:16] (03CR) 10Muehlenhoff: [C: 03+2] Add a minimal setup.py and switch to dh-python [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541528 (owner: 10Muehlenhoff) [12:44:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1093 after schema change', diff saved to https://phabricator.wikimedia.org/P9261 and previous config saved to /var/cache/conftool/dbconfig/20191008-124417-marostegui.json [12:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:22] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add druid load [puppet] - 10https://gerrit.wikimedia.org/r/541535 [12:48:23] 10Operations, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Patch-For-Review, 10User-Ladsgroup, and 2 others: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jhsoby) I have imported all Hindi-specific pages now (everything that was in the Hindi category on mulwikisource). What remai... [12:52:00] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: add druid load [puppet] - 10https://gerrit.wikimedia.org/r/541535 (owner: 10Elukey) [12:53:36] (03PS1) 10Muehlenhoff: Bump changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541536 [12:54:11] akosiaris: merged your changes for labs private [12:54:20] elukey: ah, thanks! [12:55:00] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [12:57:35] ah this is downtime expired --^ [12:57:38] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541538 [12:59:50] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [13:02:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comment, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541522 (owner: 10Giuseppe Lavagetto) [13:05:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541538 (owner: 10Marostegui) [13:06:17] (03CR) 10Jhedden: [C: 03+1] CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [13:06:42] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541538 (owner: 10Marostegui) [13:07:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool es1011 (duration: 00m 51s) [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:21] (03CR) 10Elukey: [C: 03+1] "lgtm, but let's triple check with search as well!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539094 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [13:15:07] (03PS5) 10Ottomata: reportupdater: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [13:15:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] reportupdater: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [13:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082 db1081 db1080 db1079 db1075 db1074 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9262 and previous config saved to /var/cache/conftool/dbconfig/20191008-131752-marostegui.json [13:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] (03PS1) 10Elukey: profile::analytics::refinery::job::druid_load: add kerb support [puppet] - 10https://gerrit.wikimedia.org/r/541541 (https://phabricator.wikimedia.org/T226698) [13:21:25] (03PS2) 10Elukey: profile::analytics::refinery::job::druid_load: add kerb support [puppet] - 10https://gerrit.wikimedia.org/r/541541 (https://phabricator.wikimedia.org/T226698) [13:26:19] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18791/" [puppet] - 10https://gerrit.wikimedia.org/r/541541 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:29:07] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@8490964]: Update mobileapps to abd3543 [13:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:09] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1082 db1081 db1080 db1079 db1075 db1074 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9263 and previous config saved to /var/cache/conftool/dbconfig/20191008-133208-marostegui.json [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:12] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@8490964]: Update mobileapps to abd3543 (duration: 06m 04s) [13:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] !log marostegui@cumin2001 dbctl commit (dc=all): 'More traffic for db1082 db1081 db1080 db1079 db1075 db1074 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9264 and previous config saved to /var/cache/conftool/dbconfig/20191008-134152-marostegui.json [13:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:33] (03PS1) 10Jbond: puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 [13:43:12] (03PS3) 10Mholloway: Update wikifeeds chart to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) [13:44:13] (03PS2) 10Jbond: puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) [13:44:35] (03CR) 10Mholloway: [V: 03+2 C: 03+2] Update wikifeeds chart to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [13:46:57] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [13:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:12] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [13:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:34] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool db1082 db1081 db1080 db1079 db1075 db1074 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9265 and previous config saved to /var/cache/conftool/dbconfig/20191008-135033-marostegui.json [13:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:59] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1103:3312 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9266 and previous config saved to /var/cache/conftool/dbconfig/20191008-135058-marostegui.json [13:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:03] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [13:53:20] (03PS1) 10Elukey: profile::analytics::refinery::job::test::druid_load: use analytics1041 [puppet] - 10https://gerrit.wikimedia.org/r/541547 [13:53:38] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [13:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:25] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10herron) 05Open→03Resolved a:03herron Hello, the WMFSF list has been disabled and archives will remain in place. I'll transition to resolved now. Thanks! [13:55:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18792/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/541547 (owner: 10Elukey) [14:00:15] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:55] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:46] (03CR) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541522 (owner: 10Giuseppe Lavagetto) [14:02:48] (03PS5) 10Giuseppe Lavagetto: lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 [14:05:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs::monitor_service: partial refactoring [puppet] - 10https://gerrit.wikimedia.org/r/541522 (owner: 10Giuseppe Lavagetto) [14:08:35] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [14:09:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "Looks good!" [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [14:11:48] (03PS1) 10Mholloway: Update charts/index.yaml to add wikifeeds v0.0.4 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/541548 (https://phabricator.wikimedia.org/T170455) [14:15:38] (03CR) 10Muehlenhoff: "This sounds perfectly fine, the chroot protection is rather weak anyway (adding hardening to the systemd unit would gain more protection i" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:29:33] (03PS3) 10Jbond: puppet::rsync: disable chroot on volatile and ssl rsync [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) [14:30:07] (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:42:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/541545 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:44:29] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Pcoombe) We're in peak fundraising season now, and I'm worried this might affect links to https://donate.wikimedia.o... [14:48:38] (03CR) 10Mobrovac: [C: 03+1] Update charts/index.yaml to add wikifeeds v0.0.4 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/541548 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:50:34] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Switch replication url for replica to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [14:53:50] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:56:15] 10Operations, 10Core Platform Team, 10Editing-team, 10Fundraising-Backlog, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10DStrine) [15:08:06] (03PS1) 10Elukey: profile::analytics::cluster::users: ensure user druid [puppet] - 10https://gerrit.wikimedia.org/r/541554 [15:12:09] (03CR) 10Ayounsi: [C: 03+1] CloudVPS: use wikimediacloud.org domain for Neutron-related IP addresses [dns] - 10https://gerrit.wikimedia.org/r/541526 (https://phabricator.wikimedia.org/T234836) (owner: 10Arturo Borrero Gonzalez) [15:13:49] !log renumber BGP session to AS4761 on cr1-eqsin [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:51] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) [15:15:04] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:18:14] !log add BGP sessions to AS2635 on cr2-eqiad [15:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:58] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) p:05Normal→03High Now the subscriptions were not just disabled, but some 30+ were actually unsubscribed. We're doing a huge disservice to community members that... [15:20:59] !log add BGP sessions to AS199524 on cr2-eqdfw [15:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:56] 10Operations, 10Core Platform Team, 10Editing-team, 10Fundraising-Backlog, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) @Pcoombe I don't think this will go live before January, but if it helps, let's just exclude any an... [15:30:47] !log remove 2 more sessions to AS12871 on cr2-esams - T232617 [15:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:52] T232617: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 [15:31:20] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10ayounsi) 05Open→03Resolved Seems like they had 4 sessions in total. [15:35:19] (03CR) 10Muehlenhoff: [C: 03+2] Bump changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541536 (owner: 10Muehlenhoff) [15:50:40] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10Cmjohnson) [15:51:01] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10Cmjohnson) @fgiunchedi all the on-site work has been completed...they need production DNS [15:52:17] (03PS1) 10Muehlenhoff: debdeploy: Fix update_type type [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/541558 [15:55:37] (03CR) 10Mholloway: [V: 03+2 C: 03+2] Update charts/index.yaml to add wikifeeds v0.0.4 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/541548 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [15:56:28] (03PS2) 10Elukey: profile::analytics::cluster::users: ensure user druid [puppet] - 10https://gerrit.wikimedia.org/r/541554 [15:57:37] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [15:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:58] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [15:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:10] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [16:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:30] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:01:30] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:02:41] (03PS3) 10Elukey: profile::analytics::cluster::users: ensure user druid [puppet] - 10https://gerrit.wikimedia.org/r/541554 [16:04:34] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) ping again on this :) [16:04:43] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10wiki_willy) a:05Jclark-ctr→03RobH @RobH - can you take care of DNS for this to get things completed from the dc-ops side for this install? This one's super urgent, so if you... [16:10:02] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18795/" [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey) [16:15:25] (03PS3) 10Herron: logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486 [16:20:14] (03CR) 10Herron: [C: 03+2] logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron) [16:29:10] (03CR) 10Daimona Eaytoy: [C: 03+1] [cirrus] drop support for HHVM connection pooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541425 (owner: 10DCausse) [16:35:56] (03PS1) 10RobH: ms-be105[1-6] production dns [dns] - 10https://gerrit.wikimedia.org/r/541567 (https://phabricator.wikimedia.org/T232367) [16:49:29] (03CR) 10RobH: [C: 03+2] ms-be105[1-6] production dns [dns] - 10https://gerrit.wikimedia.org/r/541567 (https://phabricator.wikimedia.org/T232367) (owner: 10RobH) [16:50:50] godog: heyas did you wanna handle the puppet and os install on new ms-be hosts or should i? [16:50:59] production dns is now in place [16:51:16] (either answer is fine i just dunno if you reimage them from role spare or not so asking here) [16:53:23] robh: hey, reimaging / puppet run into their swift::storage role is fine, thanks! [16:53:36] as in, they won't enter service even with the role applied [16:54:02] ahh, ok so install and apply normal role immediatly no spare [16:54:14] want me to install? [16:54:41] robh: yes please, thanks! [16:54:54] ok, ill do now and you should have the task assigned to you when you start tomorrow =] [16:55:02] should work on first attempt, let me know if it doesn't [16:55:16] robh: working PDT hours this week (in Vancouver) so I'm around ! [16:55:33] oh, cool [16:55:38] then you'll have shortly [16:57:39] awesome, thanks for your help [16:58:19] welcome [17:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1700). [17:00:37] no parsoid deploy today [17:02:15] 10Operations, 10ops-eqiad, 10Patch-For-Review: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:11:26] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:15:03] (03PS1) 10RobH: setting most new ms-be hosts mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/541578 (https://phabricator.wikimedia.org/T232367) [17:16:21] (03CR) 10RobH: [C: 03+2] setting most new ms-be hosts mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/541578 (https://phabricator.wikimedia.org/T232367) (owner: 10RobH) [17:16:48] (03PS3) 10Krinkle: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [17:16:54] (03PS4) 10Krinkle: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [17:19:52] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) I don't know what you need me to do...the servers were setup correctly. [17:22:37] (03CR) 10Brennen Bearnes: "> Patch Set 2: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [17:23:56] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:28:11] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [17:30:16] chaomodus: ^ [17:30:28] yejp [17:32:16] !log cutting wmf/1.35.0-wmf.1 [17:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:08] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:45:45] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Cmjohnson) [17:46:57] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Cmjohnson) the pdu swap is over, we did lose an-worker1079 due to the PSUs not failing over. Everything is cabled and they're linked together. still needs updating. [17:50:09] (03PS1) 10RobH: fixing mac entries for new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/541579 (https://phabricator.wikimedia.org/T232367) [17:52:08] (03CR) 10RobH: [C: 03+2] fixing mac entries for new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/541579 (https://phabricator.wikimedia.org/T232367) (owner: 10RobH) [17:54:04] (03PS8) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [17:56:24] (03CR) 10Ottomata: "I think if you just include the druid user in profile::hadoop::master::hadoop_user_groups, the hdfs home dir will be auto-create." [puppet] - 10https://gerrit.wikimedia.org/r/541554 (owner: 10Elukey) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1800) [18:12:53] (03CR) 10Brennen Bearnes: [C: 03+2] Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [18:13:00] (03CR) 10jerkins-bot: [V: 04-1] Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [18:15:08] (03PS1) 10Jforrester: [Beta Cluster] Disable wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541581 [18:16:33] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10wiki_willy) a:05Cmjohnson→03RobH Re-assigning to @RobH to complete install/updating of new PDU. Thanks, Willy [18:19:09] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix phatality deployment script [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [18:19:16] (03PS2) 10Filippo Giunchedi: Fix phatality deployment script [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [18:20:06] (03PS2) 10Dzahn: parsoid/conftool: add wtp servers as apache appservers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [18:20:46] (03PS9) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [18:21:43] (03PS3) 10Jeena Huneidi: Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) [18:22:32] (03CR) 1020after4: [C: 03+1] site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [18:26:12] (03PS1) 10Dduvall: Group0 to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541582 [18:35:35] (03CR) 10Brennen Bearnes: [C: 03+2] Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [18:43:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:43:12] (03PS2) 10Jforrester: [Beta Cluster] Disable wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541581 (https://phabricator.wikimedia.org/T72470) [18:44:29] (03Merged) 10jenkins-bot: Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) (owner: 10Jeena Huneidi) [18:45:23] !log dduvall@deploy1001 Pruned MediaWiki: 1.34.0-wmf.24 (duration: 08m 24s) [18:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] !log codfw-prod: more weight to ms-be205[1-6] - T233638 [18:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:20] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [18:54:21] !log dduvall@deploy1001 Started scap: testwiki to php-1.35.0-wmf.1 and rebuild l10n cache [18:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:03] (03PS10) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [18:57:20] "Currently active MediaWiki versions:" is broken on noc/conf :( [18:58:45] looks to be a cached issue [18:59:01] works locally on deploy1001 [18:59:02]

Currently active MediaWiki versions: 1.34.0-wmf.25, 1.35.0-wmf.1

[18:59:45] 10Operations, 10media-storage, 10User-fgiunchedi: ms-be1020 - host went down - https://phabricator.wikimedia.org/T234698 (10fgiunchedi) Indeed we'd need to upgrade its firmware as per {T141756}, holding off once we have new swift hw in place in eqiad to not "jinx it" if we possibly can [19:00:04] marxarelli: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - American version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1900). [19:00:49] (03CR) 10Cwhite: [C: 03+2] "Fixed deb package build." [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [19:02:22] (03PS1) 10CRusnov: Add netbox geodns entries. [dns] - 10https://gerrit.wikimedia.org/r/541602 [19:12:08] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556235298(77gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [19:13:16] (03CR) 10Dzahn: "let' use topic branch "gerrit-migration-day" on all the patches we want to merge on the day of" [dns] - 10https://gerrit.wikimedia.org/r/541393 (owner: 10Paladox) [19:13:42] !log dduvall@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.1 and rebuild l10n cache (duration: 19m 21s) [19:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] (03PS3) 10Dzahn: Gerrit: Switch replication url for replica to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [19:14:53] mutante remember gerrit requires a restart after merging that :) (just making sure that your aware). [19:15:11] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541115/ [19:15:16] should stop us having to restart. [19:15:27] paladox: yes, i am aware. that is great news though [19:15:33] ok :) [19:15:55] hmm. yea. the autoReload thing [19:16:27] setting that to _false_ makes it .. eh.. reload config ?? [19:16:33] reads that again [19:17:06] i am not sure we actually prefer that over having control over both merge and restart separately [19:17:54] first let me merge the part that is already reviewed and we all agree [19:18:26] (03PS1) 10Jforrester: CommonSettings-labs: array_merge on NULL returns NULL, not [], what fun [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541603 [19:18:28] (03PS1) 10Jforrester: CommonSettings: Split out the CSP configuration s it can be more easily over-ridden [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541604 [19:18:34] (03CR) 10Dzahn: [C: 03+2] Gerrit: Switch replication url for replica to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [19:19:00] paladox: let's use new topic branch name "gerrit-migration-day" and slap it on the patches for the "day of" [19:19:10] ok, yup! [19:19:12] thanks! [19:29:16] !log adding swagger exporter to apt repo [19:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:25] !log dduvall@deploy1001 Synchronized php-1.35.0-wmf.1/skins/MinervaNeue/: sync T233521 backport prior to group0 (duration: 00m 59s) [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:30] T233521: Language selector and edit button icon not displayed on first load in iOS 13 - https://phabricator.wikimedia.org/T233521 [19:39:02] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556235298(77gb) Mathew.onipe Ill silence this for. Will keep an eye to see if it recovers. If it doesnt, then reindex is imminent. - The acknowledgement expires at: 2019-10-09 19:36:56. https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [19:39:06] (03CR) 10EBernhardson: [C: 03+1] [cirrus] drop support for HHVM connection pooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541425 (owner: 10DCausse) [19:40:13] (03PS1) 10Phamhi: tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) [19:40:36] (03CR) 10Dduvall: [C: 03+2] Group0 to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541582 (owner: 10Dduvall) [19:41:39] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541582 (owner: 10Dduvall) [19:43:40] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.1 [19:43:43] (03CR) 10Phamhi: tools-webservice: Disable access.log feature by default (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) (owner: 10Phamhi) [19:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:47] !log 1.35.0-wmf.1 promoted to group0, cc: T233849. no rise in error rates. no new relevant errors [19:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:50] T233849: 1.35.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T233849 [19:52:44] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27976 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [19:54:30] (03PS6) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [19:54:32] (03PS1) 10Dzahn: add wtp1025/wtp2001 to list of servers using Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) [19:57:28] jouncebot: now [19:57:28] For the next 0 hour(s) and 2 minute(s): MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T1900) [19:59:32] (03CR) 10SBassett: [C: 03+1] CommonSettings-labs: array_merge on NULL returns NULL, not [], what fun [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541603 (owner: 10Jforrester) [20:00:33] (03CR) 10SBassett: [C: 03+1] CommonSettings: Split out the CSP configuration s it can be more easily over-ridden [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541604 (owner: 10Jforrester) [20:09:23] 10Operations, 10Cloud-Services: login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 (10Dzahn) [20:11:36] 10Operations, 10Traffic: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10crusnov) [20:12:00] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:12:12] (03PS2) 10CRusnov: Add netbox geodns entries. [dns] - 10https://gerrit.wikimedia.org/r/541602 (https://phabricator.wikimedia.org/T234997) [20:13:06] (03CR) 10Subramanya Sastry: add wtp1025/wtp2001 to list of servers using Parsoid/PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:15:18] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [20:24:30] !log labweb1001 - edit /srv/mediawiki/wmf-config/wikitech.php to and change "false" to "true" on line 52 to enable LDAP debug logging for T234996 [20:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:34] T234996: login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 [20:25:44] (03PS1) 10Cwhite: profile, prometheus: install swagger exporter on icinga [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [20:28:09] 10Operations, 10Cloud-Services: login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 (10Dzahn) Enabled the debug log as suggested by Krenair. Debug log shows a restCall to cloudcontrol1003 to get a token: OpenStackNovaController::restCall fullurl: http://cloudcontrol1003.wikimedia.org fo... [20:29:00] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10mepps) [20:30:10] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10mepps) [20:38:51] !log labweb1001 - disabled 2fa for myself on Wikitech using disableOATHAuthForUser.php --wiki=labswiki to debug T234996 [20:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:55] T234996: login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 [20:43:05] (03CR) 10Eevans: [C: 03+1] "LGTM, but we ought to have @BPirkle take a look (since he did the same for sessionstore)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [21:00:01] (03CR) 10Subramanya Sastry: add wtp1025/wtp2001 to list of servers using Parsoid/PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:01:06] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:01:22] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:01:32] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:01:42] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:14] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:02:26] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:02:34] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:02:36] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:03:10] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:03:20] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:28] notebook1003 - echo "please don't use all the RAM" | wall [21:03:37] ;) [21:03:40] :-D [21:03:48] mmaybe we could run nagios-nrpe-server in a slice that can't get oomed [21:03:52] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 23073 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [21:04:10] chaomodus: that would be nice indeed [21:07:11] i guess we can exempt it from oom killer (well reduce its score enough to exempt it) [21:08:00] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:08:10] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:12] it died again ha [21:08:40] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:08:52] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:09:00] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:09:02] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:09:10] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:09:26] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:09:29] or we exclude this host from icinga alerts and declare it a test server [21:09:36] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:09:46] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:46] !log restarted nagios-nrpe-server on notebook1003 [21:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:52] yah i suppose, or give it more ram [21:10:59] oom killeri s super killy on that box [21:11:19] https://phabricator.wikimedia.org/T212824 [21:11:42] see " Introduce profile::analytics::cluster::limits::statistics" etc [21:12:18] oic [21:12:18] https://phabricator.wikimedia.org/T212824#4967798 [21:18:51] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10Varnent) @herron - just want to verify that the old list will forward to the new one in case people use the old SF address. New list is: sf-foundation-local@wikimedia.org [21:23:10] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [21:23:36] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 (10bd808) [21:24:12] (03PS2) 10Subramanya Sastry: Add wtp1025/wtp2001 to the list of servers using Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:25:15] (03CR) 10Subramanya Sastry: "mutante: Uploaded a PS2 with changes I was recommending." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:26:12] (03CR) 10Subramanya Sastry: "Added CPT members + Gergo in case they have opinions about how this is configured in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541611 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:26:22] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 (10bd808) Possibly a duplicate of {T234686} reported on 2019-10-04 [21:29:15] (03PS1) 10Dzahn: logstash: add wtp1025/wtp2001 to filter-mediawiki with parsoid-php channel [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) [21:30:50] (03CR) 10Subramanya Sastry: [C: 03+1] logstash: add wtp1025/wtp2001 to filter-mediawiki with parsoid-php channel [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:31:32] (03CR) 10Subramanya Sastry: [C: 03+1] "Added parsing team members as reviewers as an FYI." [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:42:28] (03PS7) 10Brennen Bearnes: mediawiki-dev: use wikimedia/mediawiki-core:dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T234391) [21:49:43] (03CR) 10Krinkle: [C: 03+1] [Beta Cluster] Disable wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541581 (https://phabricator.wikimedia.org/T72470) (owner: 10Jforrester) [21:53:44] Prod clear? Going to deploy some fixes. [21:54:02] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Disable wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541581 (https://phabricator.wikimedia.org/T72470) (owner: 10Jforrester) [21:54:07] (03CR) 10Jforrester: [C: 03+2] CommonSettings-labs: array_merge on NULL returns NULL, not [], what fun [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541603 (owner: 10Jforrester) [21:54:09] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Split out the CSP configuration s it can be more easily over-ridden [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541604 (owner: 10Jforrester) [21:55:10] (03Merged) 10jenkins-bot: [Beta Cluster] Disable wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541581 (https://phabricator.wikimedia.org/T72470) (owner: 10Jforrester) [21:55:37] (03Merged) 10jenkins-bot: CommonSettings-labs: array_merge on NULL returns NULL, not [], what fun [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541603 (owner: 10Jforrester) [21:55:46] (03Merged) 10jenkins-bot: CommonSettings: Split out the CSP configuration s it can be more easily over-ridden [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541604 (owner: 10Jforrester) [21:58:25] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Split out the CSP configuration s it can be more easily over-ridden (duration: 00m 59s) [21:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:17] (03PS9) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [22:06:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [22:07:46] (03PS1) 10RobH: ms-be105[15] dhcp info [puppet] - 10https://gerrit.wikimedia.org/r/541652 (https://phabricator.wikimedia.org/T232367) [22:08:15] (03PS10) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [22:08:17] (03CR) 10RobH: [C: 03+2] ms-be105[15] dhcp info [puppet] - 10https://gerrit.wikimedia.org/r/541652 (https://phabricator.wikimedia.org/T232367) (owner: 10RobH) [22:09:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 39 probes of 463 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:10:50] (03PS11) 10Filippo Giunchedi: site: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) [22:11:31] (03PS1) 10EBernhardson: yarn: Add sequential scheduler queue for heavy jobs [puppet] - 10https://gerrit.wikimedia.org/r/541654 [22:15:20] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 463 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:21:30] (03PS1) 10Jforrester: [Beta Cluster] Enable wmgUseCSPReportOnly for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) [22:21:48] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/18799/" [puppet] - 10https://gerrit.wikimedia.org/r/537613 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [22:25:24] (03CR) 10SBassett: [C: 03+1] [Beta Cluster] Enable wmgUseCSPReportOnly for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541655 (https://phabricator.wikimedia.org/T211539) (owner: 10Jforrester) [22:26:04] (03PS2) 10Dzahn: site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) [22:27:32] (03PS3) 10Dzahn: site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) [22:28:18] (03CR) 10Krinkle: "Sounds good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [22:44:27] (03CR) 10Dzahn: [C: 03+2] site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [22:44:41] (03PS4) 10Dzahn: site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) [22:51:07] ACKNOWLEDGEMENT - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn check with dcops for status of PDU work [22:54:48] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.02772 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191008T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:15] i can ship it [23:01:52] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541425 (owner: 10DCausse) [23:02:39] (03Merged) 10jenkins-bot: [cirrus] drop support for HHVM connection pooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541425 (owner: 10DCausse) [23:03:07] ACKNOWLEDGEMENT - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.05351 ge 0.01 daniel_zahn thats just 4 hosts and the trigger was 3 - 4 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:05:18] !log ebernhardson@deploy1001 Synchronized wmf-config/: [cirrus] drop support for HHVM connection pooling (duration: 00m 59s) [23:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:39] SWAT complete [23:18:41] (03PS1) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [23:20:35] (03PS2) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [23:22:48] (03CR) 10jerkins-bot: [V: 04-1] phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:24:46] (03CR) 10Dzahn: "needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/541666 and more follow-ups, a couple things not working yet" [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:26:16] (03CR) 10Paladox: phabricator: support buster with PHP 7.3 packages (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:28:53] !log phab1001 - replacing tin.eqiad.wmnet with deploy1001.eqiad.wmnet in phabricator/deployment-cache/.config:git_server - wondering if we can ever get rid of tin (T190568) [23:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:01] T190568: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 [23:31:57] (03CR) 10Paladox: phabricator: support buster with PHP 7.3 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:33:56] (03PS3) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [23:36:51] (03CR) 10Dzahn: "also changes on phab1003 ? https://puppet-compiler.wmflabs.org/compiler1001/18803/" [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:42:12] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10leila) @Nuria is your approval needed on this task? [23:46:00] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): login on wikitech wiki fails - https://phabricator.wikimedia.org/T234996 (10bd808) I set a custom message at https://wikitech.wikimedia.org/wiki/MediaWiki:Loginprompt that will show up on the login screen.{F30597255} [23:54:25] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [23:56:21] (03PS4) 10Dzahn: phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) [23:56:29] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10RobH) [23:57:16] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Login on wikitech wiki fails after OpenStack upgrade removed v2 identity API - https://phabricator.wikimedia.org/T234996 (10bd808) [23:57:19] (03CR) 10Dzahn: [C: 04-1] "there are more "php72" require lines.. need to repeat a bunch of code or do it nicer some way" [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:58:16] (03CR) 10jerkins-bot: [V: 04-1] phabricator: support buster with PHP 7.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/541666 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn)