[00:02:09] RECOVERY - swift-object-updater on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [00:02:09] RECOVERY - dhclient process on ms-be2043 is OK: PROCS OK: 0 processes with command name dhclient [00:02:09] RECOVERY - swift-account-server on ms-be2043 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [00:02:09] RECOVERY - swift-object-server on ms-be2043 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [00:02:09] RECOVERY - swift-container-replicator on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [00:02:47] RECOVERY - swift-account-replicator on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [00:02:57] RECOVERY - swift-object-replicator on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [00:02:57] RECOVERY - swift-container-server on ms-be2043 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [00:03:03] RECOVERY - swift-container-auditor on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [00:03:23] RECOVERY - swift-container-updater on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [00:11:01] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 68.07, 86.00, 102.50 https://wikitech.wikimedia.org/wiki/Swift [00:21:26] (03PS12) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [00:21:28] (03PS5) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [00:21:30] (03PS5) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [00:27:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) [00:27:53] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 66.49, 66.69, 79.14 https://wikitech.wikimedia.org/wiki/Swift [00:47:43] (03PS13) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [00:47:46] (03PS6) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [00:47:48] (03PS6) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [00:59:03] 10Operations, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) I am running a compare on db2091 now to check its data consistency [01:10:58] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2037 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513021 (https://phabricator.wikimedia.org/T221533) [02:13:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) [02:14:21] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) @Marostegui @jcrespo All your's you can take this task anytime. [02:15:52] PROBLEM - swift-account-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [02:15:52] PROBLEM - swift-object-auditor on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [02:16:59] RECOVERY - swift-object-auditor on ms-be2043 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [02:16:59] RECOVERY - swift-account-auditor on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [02:26:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Marostegui) 05Open→03Resolved Thanks! They all look good! ` 4 hosts will be targeted: dbproxy[2001-2004].codfw.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ====... [02:27:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Marostegui) [02:29:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Marostegui) [02:44:02] (03CR) 10Marostegui: [C: 03+1] mariadb: Disable checks of database snapshots [puppet] - 10https://gerrit.wikimedia.org/r/512894 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [04:00:05] PROBLEM - Host cr2-eqord IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ffff::5) [04:00:49] PROBLEM - Host cr2-eqord is DOWN: PING CRITICAL - Packet loss = 100% [04:02:13] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:02:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:56] (03CR) 10Ppchelko: RESTRouter: Add initial Helm chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [05:10:45] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:49] RECOVERY - Host cr2-eqord IPv6 is UP: PING OK - Packet loss = 0%, RTA = 61.75 ms [05:11:01] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:07] RECOVERY - Host cr2-eqord is UP: PING OK - Packet loss = 0%, RTA = 68.07 ms [05:14:23] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:14:23] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:22:45] wait what? [05:36:35] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) p:05Triage→03High [05:38:59] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) [05:52:25] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) So for the two that went down there was no planned maintenance, but we did get an email from the vendor ("00985243 Disturbance") suggesting that this was an unplanne... [06:00:23] 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10WMDE-leszek) >>! In T224507#5218859, @Aklapper wrote: > @WMDE-leszek: Does that mean that the accounts https://phabricator.wikimedia.org/p/Greta_Doci_WMDE/ and https://meta.wikimedia... [06:10:51] paravoid: saw your task/email anything I can do? [06:11:02] XioNoX: nah, all good, go to bed :) [06:11:09] ok :) [06:11:20] thx for taking care of it [06:11:55] yw :) [06:12:27] paravoid: can you have a look at https://phabricator.wikimedia.org/T224511 ? I want to do it tomorrow but would like a review [06:12:35] looking [06:14:50] XioNoX: LGTM [06:15:26] great, thx [06:15:35] time to sleep! [06:23:24] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10faidon) - esams should be blacklisted for now indeed. - `test_nb_inventory_in_librenms` could use some improvement -- it didn't... [06:27:11] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:27:11] RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 4 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:28:23] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:30:13] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:33:33] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:38:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [06:53:57] (03CR) 10Mobrovac: RESTRouter: Add initial Helm chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [06:57:13] RECOVERY - puppet last run on analytics1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:31] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:51] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:03:57] !log restarting pdfrender on scb1003 [07:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:38] 10Operations, 10netops: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) Did a chown to www-data:librenms: ` elukey@netmon1002:~$ ls -l /var/log/librenms/daily.log* -rw------- 1 www-data librenms 0 May 13 06:25 /var/log/librenms/daily.log -rw-r--r-- 1 www... [07:05:33] (03CR) 10Filippo Giunchedi: [C: 03+1] site: remove duplicate node definitions [puppet] - 10https://gerrit.wikimedia.org/r/512952 (owner: 10Cwhite) [07:06:01] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.594 second response time https://phabricator.wikimedia.org/T174916 [07:08:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/511708 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [07:10:25] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:10:31] (03PS1) 10Muehlenhoff: Record extended MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/513028 [07:11:45] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.534 second response time https://phabricator.wikimedia.org/T174916 [07:16:05] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:17:12] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/513028 (owner: 10Muehlenhoff) [07:17:21] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.026 second response time https://phabricator.wikimedia.org/T174916 [07:18:45] (03PS2) 10Jcrespo: mariadb: Disable checks of database snapshots [puppet] - 10https://gerrit.wikimedia.org/r/512894 (https://phabricator.wikimedia.org/T206203) [07:21:47] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:22:53] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable checks of database snapshots [puppet] - 10https://gerrit.wikimedia.org/r/512894 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:27:59] (03CR) 10Hashar: Add jenkins-agent user to releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [07:29:07] (03PS1) 10Effie Mouzeli: Remove kafka1018 from ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) [07:30:34] (03PS2) 10Effie Mouzeli: Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) [07:30:51] (03PS1) 10Hashar: Fix passing ssh key on releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/513034 [07:31:21] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [07:31:55] (03CR) 10Hashar: "I have added the host as an agent already: https://releases-jenkins.wikimedia.org/computer/releases1001.eqiad.wmnet/ ;]" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [07:33:02] thnx jijiki for the restart [07:33:09] 10Operations, 10Traffic: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) [07:33:19] 10Operations, 10Traffic: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) p:05Triage→03Normal [07:34:32] (03PS3) 10Urbanecm: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) [07:34:42] :D [07:34:44] (03CR) 10jerkins-bot: [V: 04-1] Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) (owner: 10Urbanecm) [07:37:09] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [07:37:14] (03PS4) 10Urbanecm: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) [07:37:33] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [07:40:01] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time https://phabricator.wikimedia.org/T174916 [07:40:54] !log ms-be2043 start sdd rebuild - T222654 [07:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:59] T222654: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 [07:41:03] (03CR) 10Elukey: "It seems good but I have no idea where this config is used now. In theory we should be able to remove all the old Kafka Analytics hosts fr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [07:42:37] !log decommission restbase1015-b -- T223976 [07:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:42] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [07:46:27] (03CR) 10Hashar: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/186/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [07:48:27] (03PS5) 10Hashar: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 [07:48:47] (03PS6) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [07:48:59] (03PS7) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [07:49:15] (03PS7) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [08:00:36] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491564 (owner: 10Hashar) [08:01:46] (03CR) 10jenkins-bot: FlaggedRevisions: Copy in rest of the config, for static registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512053 (owner: 10Reedy) [08:02:45] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:03:07] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:05:04] (03PS2) 10Hashar: zuul: log stack dump to their own file [puppet] - 10https://gerrit.wikimedia.org/r/505253 [08:05:16] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar) [08:06:57] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [08:07:09] (03PS1) 10Elukey: Introduce profile::analytics::search::data_drop [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) [08:07:45] PROBLEM - MD RAID on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:07:45] PROBLEM - swift-container-server on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [08:07:47] PROBLEM - swift-object-replicator on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [08:07:53] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [08:07:53] PROBLEM - Check size of conntrack table on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:07:53] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [08:08:01] PROBLEM - dhclient process on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer [08:08:01] sigh, known [08:08:13] PROBLEM - Disk space on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.113: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:08:27] silenced [08:08:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16793/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [08:09:03] RECOVERY - MD RAID on ms-be2043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:09:03] RECOVERY - swift-container-server on ms-be2043 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [08:09:03] RECOVERY - swift-object-replicator on ms-be2043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [08:09:11] RECOVERY - Check size of conntrack table on ms-be2043 is OK: OK: nf_conntrack is 7 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:09:11] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [08:09:19] RECOVERY - dhclient process on ms-be2043 is OK: PROCS OK: 0 processes with command name dhclient [08:09:23] (03CR) 10Elukey: "Ready for the naming/puppet/etc.. bikeshed! :D" [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [08:09:29] RECOVERY - Disk space on ms-be2043 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:14:29] 10Operations, 10RESTBase-API, 10TechCom, 10serviceops, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac) [08:14:35] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) 05Stalled→03Open [08:14:46] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) Last step: deployment [08:15:21] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 70.04, 78.07, 78.48 https://wikitech.wikimedia.org/wiki/Swift [08:15:42] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [08:18:17] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:31:15] !log draining ganeti2001 for eventual reboot to pick up MDS-enabled kernel [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:31:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:34] (03PS9) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [08:32:55] (03CR) 10jerkins-bot: [V: 04-1] swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [08:33:07] (03CR) 10Hashar: "I have to fix conflict with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/507386/ "swift: hiera-ize object-replicator concurrency" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [08:34:40] (03CR) 10Volans: "A question a documentation nits inline, looks good otherwise." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [08:50:38] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) Ok, we got different takeaways from that ticket (that I did read in the past). Let'... [08:51:13] !log draining ganeti2002 for eventual reboot to pick up MDS-enabled kernel [08:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:31] (03CR) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [08:56:07] (03PS1) 10Hashar: swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) [08:56:09] (03PS1) 10Hashar: beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) [08:56:52] (03CR) 10Hashar: "the reason is to change the interval on deployment-prep / beta cluster which is done via: https://gerrit.wikimedia.org/r/#/c/operations/p" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [08:57:09] (03CR) 10jerkins-bot: [V: 04-1] swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [09:00:25] (03PS2) 10Hashar: swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) [09:00:27] (03PS2) 10Hashar: beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) [09:09:17] (03CR) 10Volans: Add a Spicerack cook book to reboot hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:09:55] (03PS1) 10Jcrespo: mariadb: Depool db2037 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513057 [09:10:35] (03CR) 10Volans: Add a Spicerack cook book to reboot hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:13:25] (03PS1) 10Hashar: swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) [09:13:27] (03PS1) 10Hashar: beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) [09:17:04] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudstore (backups) - https://phabricator.wikimedia.org/T224528 (10aborrero) p:05Triage→03Normal a:05Andrew→03Papaul Rack proposal: anywhere in codfw, each server in a different rack, a rack with 10G support Wiring configuration: single... [09:18:18] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [09:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:08] (03CR) 10Jcrespo: "Questions:" [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:22:33] <_joe_> we have multiple servers with hhvm alarms [09:22:41] <_joe_> is anyone restarting them? [09:23:08] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org - https://phabricator.wikimedia.org/T224528 (10aborrero) [09:24:09] (03PS1) 10Hashar: swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) [09:24:11] (03PS1) 10Hashar: beta: slow down swift container replication [puppet] - 10https://gerrit.wikimedia.org/r/513063 (https://phabricator.wikimedia.org/T160990) [09:24:12] I don't see them anymore, I sometimes see soft alarms on mw servers? [09:25:32] (03Abandoned) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [09:26:59] (03CR) 10Muehlenhoff: "> no check for an arbitrary large number of hosts by mistake, for example '*'?" [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:28:01] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) OK, so the vendor "bounced the interface" and the eqiad<->eqord traffic has been restored. What they noticed -and I confirmed- is that this interface was not carryin... [09:31:48] (03CR) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:32:01] (03CR) 10Jcrespo: "> Yeah, this isn't a generic cook book which handles arbitrary reboots, earlier revisions had some support for depooling but the underlyin" [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:32:36] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) [09:34:14] (03CR) 10Muehlenhoff: "@jcrespo: Ack, I'll add some documentation to that extent." [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:35:15] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10faidon) a:03ayounsi [09:39:00] (03PS1) 10Giuseppe Lavagetto: Fix for jessie: depend on the right gevent version. [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/513067 [09:40:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix for jessie: depend on the right gevent version. [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/513067 (owner: 10Giuseppe Lavagetto) [09:42:48] (03PS1) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) [09:42:51] (03CR) 10Volans: "> Patch Set 5:" [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [09:45:59] <_joe_> !log uploading a new service-checker version to jessie-wikimedia [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:09] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:49:04] 10Operations: Migrate URL downloaders to Stretch/Buster - https://phabricator.wikimedia.org/T224551 (10MoritzMuehlenhoff) [09:49:16] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561 (10Ottomata) [09:51:26] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [09:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:19] (03CR) 10Reedy: [C: 04-2] "Yay!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [09:55:54] (03CR) 10Reedy: [C: 04-2] [cirrus] Load cirrus using wfLoadExtension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [09:57:56] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [09:58:16] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Remove db2037 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513021 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [09:59:27] (03PS2) 10Jcrespo: mariadb: Depool db2087 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513057 [10:01:30] (03CR) 10Faidon Liambotis: [C: 04-1] "Some additional code changes in addition to the wider policy-related comments on the task :)" (0313 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [10:18:09] (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:18:24] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:21:23] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:21:41] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:21:43] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10MoritzMuehlenhoff) [10:24:44] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494 (10Krenair) {T224549} [10:29:15] (03CR) 10Volans: [C: 04-1] Add the LBRemoteCluster class. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:29:49] !log mobrovac@deploy1001 Started deploy [restbase/deploy@92591a7] (dev-cluster): Switch to OpenAPI v3 [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:33] (03CR) 10Jbond: [C: 03+1] "LGTM, i missed one autorequire but it harmless" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [10:31:14] 10Operations: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [10:31:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:32:53] (03Restored) 10Pmiazga: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [10:33:18] 10Operations: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) [10:33:24] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@92591a7] (dev-cluster): Switch to OpenAPI v3 (duration: 03m 36s) [10:33:27] (03PS3) 10Pmiazga: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [10:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:40] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:33:51] (03PS1) 10Vgutierrez: redirects.dat: Get rid of Apache specific variables [puppet] - 10https://gerrit.wikimedia.org/r/513077 (https://phabricator.wikimedia.org/T224539) [10:34:15] (03CR) 10jerkins-bot: [V: 04-1] Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (owner: 10Nray) [10:36:09] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:36:48] (03PS2) 10Jbond: firewall loggin: enable firewall logging on analytics servers [puppet] - 10https://gerrit.wikimedia.org/r/511702 [10:37:32] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:37:45] (03PS4) 10Pmiazga: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (https://phabricator.wikimedia.org/T223883) (owner: 10Nray) [10:37:53] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:37:57] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10MoritzMuehlenhoff) [10:38:34] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10MoritzMuehlenhoff) [10:38:36] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:38:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:48:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:48:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:51] (03CR) 10Jbond: [C: 03+2] firewall loggin: enable firewall logging on analytics servers [puppet] - 10https://gerrit.wikimedia.org/r/511702 (owner: 10Jbond) [10:51:03] PROBLEM - Host kubetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:06] akosiaris, _joe_ ^^^ [10:52:15] yup, known [10:52:17] ganeti reboots [10:52:25] that seems serious ... ah, ok [10:52:32] etcd is no longer HA per instance [10:52:37] ah, ack, I thought we were vacating those [10:52:40] Actually, lemme openclose a task [10:52:46] for posterity's sake [10:52:53] would anybody mind if I slipped in a mysql depool before o'clock? [10:53:12] (thinking mostly on mw deployers et al) [10:53:59] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db2087 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513057 (owner: 10Jcrespo) [10:54:11] brace for impact then^ [10:55:30] (03Merged) 10jenkins-bot: mariadb: Depool db2087 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513057 (owner: 10Jcrespo) [10:55:33] RECOVERY - Host kubetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [10:56:53] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:09] !log deleteBatch.php for srwikinews finished (T212346) [10:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:14] T212346: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 [10:57:23] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:39] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2087 for maintenance (duration: 01m 11s) [10:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:07] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [10:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1 and Lucas_WMDE: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1100). [11:00:04] Zoranzoki21, Urbanecm, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] o/ [11:00:19] o/ [11:00:24] I'll swat my and Zoranzoki's patches [11:00:35] I'm around if you need help [11:00:38] thanks zeljkof [11:00:48] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [11:01:07] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:01:14] Urbanecm, let me know once you're done, I can SWAT my change by myself [11:01:22] will do raynor [11:01:23] 10Operations: Migrate etcd ganeti VMs to plain disk template - https://phabricator.wikimedia.org/T224556 (10akosiaris) [11:01:41] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:01:50] 10Operations: Migrate etcd ganeti VMs to plain disk template - https://phabricator.wikimedia.org/T224556 (10akosiaris) [11:01:59] o/ [11:02:21] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:02:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:46] (03PS3) 10Urbanecm: Add namespace aliases on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [11:02:58] 10Operations: Migrate etcd ganeti VMs to plain disk template - https://phabricator.wikimedia.org/T224556 (10akosiaris) 05Open→03Resolved a:03akosiaris https://wikitech.wikimedia.org/wiki/Ganeti#VMs_without_DRBD_disk_template has been added to address the drawback needing to be communicated and documented.... [11:03:04] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [11:04:00] 10Operations: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 (10MoritzMuehlenhoff) [11:04:20] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:04:48] (03Merged) 10jenkins-bot: Add namespace aliases on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [11:05:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:06:03] deployed to mwdebug, testing there [11:07:05] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [11:07:28] syncing... [11:08:03] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [11:08:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:506892|Add namespace aliases on zhwiktionary]] (T222024) (duration: 00m 57s) [11:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:27] T222024: Add namespace aliases on zhwiktionary - https://phabricator.wikimedia.org/T222024 [11:09:47] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:09:47] PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:09:59] ^ onimisionipe [11:09:59] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:10:20] gehel: I was just modifying that downtime [11:11:08] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:11:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:11:14] (03PS2) 10Urbanecm: Fix Serbian projects' wgRestrictionLevels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) [11:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "jenkins succeeded previously, overriding to save time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [11:12:18] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:12:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:12:59] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:13:07] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:13:56] syncing 512487 [11:14:07] (03PS2) 10Urbanecm: Remove bureaucrat protection level for all Serbian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) [11:14:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [11:14:44] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:512487|Fix Serbian projects wgRestrictionLevels]] (T217005) (duration: 00m 57s) [11:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:49] T217005: Fix wgRestrictionLevels for all Serbian projects to fully work - https://phabricator.wikimedia.org/T217005 [11:15:49] (03Merged) 10jenkins-bot: Remove bureaucrat protection level for all Serbian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [11:18:03] deploying 512488... [11:18:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:512488|Remove bureaucrat protection level for all Serbian projects]] (T217005) (duration: 00m 57s) [11:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] raynor, everything should be deployed, leaving over to you [11:19:27] thx [11:19:39] that was fast Urbanecm [11:19:51] :) [11:20:00] (03PS5) 10Pmiazga: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (https://phabricator.wikimedia.org/T223883) (owner: 10Nray) [11:20:19] (03CR) 10Pmiazga: [C: 03+2] Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (https://phabricator.wikimedia.org/T223883) (owner: 10Nray) [11:21:33] raynor, if there will be some space after you finish, please give SWAT back to me, have a backlog of patches to sync, so I can sync some in the spare time :) [11:21:34] (03Merged) 10jenkins-bot: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (https://phabricator.wikimedia.org/T223883) (owner: 10Nray) [11:21:46] sure, mine should be pretty fast [11:21:52] ok [11:22:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.662e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:23:29] 10Operations, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10MoritzMuehlenhoff) [11:23:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:25:39] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:03] 10Operations, 10serviceops: Migrate Zookeeper/etcd conf cluster in codfw to Stretch - https://phabricator.wikimedia.org/T224560 (10MoritzMuehlenhoff) [11:26:15] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:28:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:28:35] (03PS1) 10Arturo Borrero Gonzalez: ldap client: sssd: introduce jessie-specific sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/513091 (https://phabricator.wikimedia.org/T224558) [11:28:59] RECOVERY - Host ms-be2036 is UP: PING WARNING - Packet loss = 66%, RTA = 36.59 ms [11:29:13] 10Operations, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10Volans) +1 on the naming and +1 on buster, they just have firewall rules, so should be pretty straightforward and easy to do. [11:29:46] syncing 509130 [11:29:52] Urbanecm, I'm almost done [11:29:57] ack raynor, thx [11:30:33] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:509130 Enable Advanced Mobile Contributions Overflow menu (T223883)]] (duration: 00m 57s) [11:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:40] T223883: Deploy AMC overflow menu to wikis with AMC enabled - https://phabricator.wikimedia.org/T223883 [11:31:06] 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10MoritzMuehlenhoff) [11:31:22] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:31:23] Urbanecm - deployed, over to you [11:31:28] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:31:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:31:30] thanks, continuing [11:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:41] zeljkof - I'm done, but looks like Urbanecm wants to do sth more [11:31:49] (03CR) 10Urbanecm: [C: 03+2] Enable transwiki import between sqwiki and sqwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) (owner: 10Urbanecm) [11:31:57] (03PS2) 10Urbanecm: Enable transwiki import between sqwiki and sqwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) [11:31:57] I'm not closing the swat window, Urbanecm please close it once you're done [11:32:03] will do raynor [11:32:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) (owner: 10Urbanecm) [11:32:34] raynor, Urbanecm: there's still plenty of time [11:32:49] 10Operations, 10Traffic, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10Volans) [11:32:52] yeah :) [11:33:13] (03Merged) 10jenkins-bot: Enable transwiki import between sqwiki and sqwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) (owner: 10Urbanecm) [11:33:36] (03CR) 10Alex Monk: [C: 04-1] "From a quick look at the diff, this should be a couple of inline if statements in the template, not a separate file." [puppet] - 10https://gerrit.wikimedia.org/r/513091 (https://phabricator.wikimedia.org/T224558) (owner: 10Arturo Borrero Gonzalez) [11:34:46] (03PS4) 10Urbanecm: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) [11:34:50] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10MoritzMuehlenhoff) [11:34:54] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [11:35:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:512478|Enable transwiki import between sqwiki and sqwikiquote]] (T221234) (duration: 00m 56s) [11:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:10] T221234: Transwiki Export/Import sqwiki - sqwikiquote back and forth - https://phabricator.wikimedia.org/T221234 [11:35:15] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10MoritzMuehlenhoff) [11:35:17] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:35:28] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:35:55] (03Merged) 10jenkins-bot: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [11:38:18] 10Operations: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10MoritzMuehlenhoff) [11:38:31] (03PS4) 10Urbanecm: Remove uploader user group from fawiki and merge it with autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) [11:38:37] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) (owner: 10Urbanecm) [11:38:38] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:38:47] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: [[:gerrit:512433|Add HD logo for angwikibooks]], logo files (T150618) (duration: 00m 56s) [11:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:52] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [11:39:40] (03Merged) 10jenkins-bot: Remove uploader user group from fawiki and merge it with autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) (owner: 10Urbanecm) [11:40:07] !log Purged angwikibooks HD logos [11:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:42:15] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [11:43:23] 10Operations: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Tentatively planning to move these straight to buster in the next quarter. [11:43:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:505228|Remove uploader user group from fawiki and merge it with autoconfirmed]], part 1 (T221441) (duration: 00m 55s) [11:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:33] T221441: Merge "uploader" group into "autoconfirmed" and "confirmed" groups in Fawiki - https://phabricator.wikimedia.org/T221441 [11:44:03] 10Operations: Reimage wezen to Stretch (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10MoritzMuehlenhoff) [11:44:23] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:44:52] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:505228|Remove uploader user group from fawiki and merge it with autoconfirmed]], part 2 (T221441) (duration: 00m 55s) [11:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:43] !log Started mwscript emptyUserGroup.php --wiki=fawiki 'uploader' (T221441) [11:45:45] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 25 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:01] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [11:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:29] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.006 second response time https://phabricator.wikimedia.org/T174916 [11:46:29] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:46:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:00] (03PS10) 10Urbanecm: RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [11:47:14] (03CR) 10Urbanecm: [C: 03+2] RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [11:47:21] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [11:48:16] (03PS3) 10Andrew Bogott: Make cloudcontrol1004 the primary keystone host [puppet] - 10https://gerrit.wikimedia.org/r/512954 (https://phabricator.wikimedia.org/T221770) [11:48:31] (03Merged) 10jenkins-bot: RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [11:50:22] (03PS2) 10Arturo Borrero Gonzalez: ldap client: sssd: introduce jessie-specific bits in sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/513091 (https://phabricator.wikimedia.org/T224558) [11:50:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:471260|RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site]] (T208458) (duration: 00m 57s) [11:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:57] T208458: Update RSS whitelist for new wmfblog location - https://phabricator.wikimedia.org/T208458 [11:51:24] (03PS2) 10Urbanecm: Set wgLocaltimezone for euwiki to Europe/Berlin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511849 (https://phabricator.wikimedia.org/T224091) [11:51:33] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511849 (https://phabricator.wikimedia.org/T224091) (owner: 10Urbanecm) [11:51:49] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol1004 the primary keystone host [puppet] - 10https://gerrit.wikimedia.org/r/512954 (https://phabricator.wikimedia.org/T221770) (owner: 10Andrew Bogott) [11:52:37] (03Merged) 10jenkins-bot: Set wgLocaltimezone for euwiki to Europe/Berlin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511849 (https://phabricator.wikimedia.org/T224091) (owner: 10Urbanecm) [11:55:01] jouncebot: next [11:55:01] In 0 hour(s) and 4 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1200) [11:55:09] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:55:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:55:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:511849|Set wgLocaltimezone for euwiki to Europe/Berlin]] (T224091) (duration: 00m 57s) [11:55:54] !log EU SWAT finished, maintenance script emptyUserGroup.php still running in separate tmux session [11:56:37] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [11:57:55] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:57:57] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:57:57] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 264 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:58:16] !log T221770 icinga downtime cloudcontrol1003.wikimedia.org for upcoming rebuild as stretch [11:58:35] hmm, stashbot's down [11:59:14] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 298 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:59:19] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2037 from config as it will be decommissioned T221533 (duration: 00m 56s) [11:59:41] that paged [11:59:46] <_joe_> it did [12:00:00] lucaswerkmeister-wmde@tools-sgebastion-07:/$ become stashbot [12:00:00] indeed [12:00:02] become: no such tool 'stashbot' [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1200) [12:00:18] that could be related to the NFS changes we have been doing lately [12:00:18] Lucas_WMDE, there is some toolforge-wide problem [12:00:21] Caught exception: [Errno 116] Stale file handle: '/data/project/toolschecker/nfs-test/ba146140-aa2f-419c-9ed2-6bc580d49dd5' [12:00:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2037 from config as it will be decommissioned T221533 (duration: 00m 56s) [12:00:22] ok [12:00:25] I even can't access my home [12:00:36] yeah, also got a “stale file handle” error on login [12:01:02] we are investigating [12:01:17] thanks arturo [12:01:54] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:01:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:04] Lucas_WMDE, hmm, will stashbot "catch up"? [12:02:22] I don’t see how, it doesn’t even seem to be in the channel anymore [12:02:37] wm-bot is still alive though, so we have logs [12:02:59] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:03:07] worst case, we can do something like !log 2019-05-29 11:55:54 !log EU SWAT finished, maintenance script emptyUserGroup.php still running in separate tmux session [12:03:18] yeah [12:03:19] so the correct timestamp and user is at least in the message [12:03:23] just wondered [12:04:23] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:06:37] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:39] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:09:29] <[1997kB]> getting 500 on *.wmflabs.org [12:09:32] Lucas_WMDE: Urbanecm if you want i can watch for stashbot and log it with the utc timestamp? [12:09:54] [1997kB], yes, known, probably nfs issues, arturo 's investigating [12:10:14] yes, we thing we know more or less where the issue is now [12:10:36] <[1997kB]> ah ok.. [12:10:51] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.107 second response time https://phabricator.wikimedia.org/T174916 [12:11:08] Zppix, would be kind :) [12:11:47] Urbanecm: will do what times do need logged up to? [12:11:54] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:11:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:12:35] 11:50 (UTC) is the last logged message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:42] Zppix, everything that stashbot didn't acknowledge, basically everything after 11:50 UTC [12:12:57] Alright will log them once it returns :) [12:14:21] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:15:09] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:15:15] thanks Zppix [12:15:15] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [12:16:59] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] [12:18:01] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:20:49] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:20:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:21:49] [1997kB], everything under wmflabs.org? o.O [12:21:53] only NFS is known to be having issues [12:23:06] !log Rolling restart pdfrender on scb* [12:23:08] <[1997kB]> probably everything, getting at 3 tools [12:23:59] <[1997kB]> +3 more [12:24:06] [1997kB], so only stuff on tools.wmflabs.org ? [12:24:09] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:47] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:25:09] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [12:25:11] <[1997kB]> fixed now [12:25:43] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:59] looks so, NFS seems to behave normally [12:26:07] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:26:07] ping Lucas_WMDE, can you start stashbot please? [12:26:37] (03CR) 10Faidon Liambotis: [C: 04-1] "One minor comment, but also: please paste a test output somewhere (phab, gdoc, whatever) so that we can validate that the output is sensib" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [12:27:02] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:27:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:51] uh, I have no idea if I actually have the rights [12:27:53] let’s see [12:27:56] 10Operations: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10MoritzMuehlenhoff) [12:28:08] nope [12:28:14] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10MoritzMuehlenhoff) [12:28:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:28:37] bd808 or greg-g, can one of you restart stashbot? [12:29:04] (hashar and yuvipanda don’t seem to be online, and that’s all the maintainers according to https://toolsadmin.wikimedia.org/tools/id/stashbot) [12:29:43] welcome stashbot! [12:29:46] !log [11:55:09] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:55] \o/ [12:30:16] !log [11:55:10] jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:02] !log [11:55:11] urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:511849|Set wgLocaltimezone for euwiki to Europe/Berlin]] (T224091) (duration: 00m 57s) [12:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:06] T224091: Change official time on euwiki to CET/CEST instead of UTC - https://phabricator.wikimedia.org/T224091 [12:31:21] (03PS23) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [12:31:31] !log [11:55:54] EU SWAT finished, maintenance script emptyUserGroup.php still running in separate tmux session [12:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] !log [11:57:55] aborrero@cumin1001 START - Cookbook sre.hosts.downtime [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:39] !log [11:57:57] aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] Zppix: ? what are you logging to the SAL? [12:33:09] arturo, missed messages [12:33:15] messages when stashbot wasn't around [12:33:26] ok [12:33:36] !log [11:58:16] T221770 icinga downtime cloudcontrol1003.wikimedia.org for upcoming rebuild as stretch [12:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:41] T221770: Upgrade cloucontrol1003/1004 to stretch/mitaka - https://phabricator.wikimedia.org/T221770 [12:34:10] !log [11:59:19] marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2037 from config as it will be decommissioned T221533 [12:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:15] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [12:34:24] wait, what? [12:34:37] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10MoritzMuehlenhoff) [12:34:49] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:34:50] the issue is not solved yet Urbanecm we are working on it [12:35:17] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:35:28] !log [12:00:21] marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2037 from config as it will be decommissioned T221533 (duration: 00m 56s) [12:35:39] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:35:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:43] Zppix: you’ll have to repeat that one again, stashbot was gone [12:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:50] Yeah i noticed :D [12:35:54] or wait until arturo says it’s really resolved :) [12:35:59] !log [12:00:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2037 from config as it will be decommissioned T221533 (duration: 00m 56s) [12:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:20] Lucas_WMDE: im almost half way through *shrug* [12:36:39] 10Operations: Migrate auth* servers to Stretch/Buster - https://phabricator.wikimedia.org/T224571 (10MoritzMuehlenhoff) [12:36:50] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:36:59] !log [12:01:54] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:34] !log [12:01:54] jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0 [12:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:27] !log [12:11:54] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:54] !log [12:11:55] jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:28] !log [12:20:49] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:51] !log [[12:20:49] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:12] (I will be going through i made a few errors i will correct onwiki [12:40:36] [12:23:06] Rolling restart pdfrender on scb* [12:40:45] !log [12:23:06] Rolling restart pdfrender on scb* [12:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:14] (03CR) 10Volans: "We're pretty much there, just couple of questions and I think a small indentation error, see inline." (035 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [12:41:29] !log [12:27:02] jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:08] !log [12:27:02] jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:38] Sweet all done doing some minor corrections onwiki now Urbanecm Lucas_WMDE [12:45:41] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:48] (03PS1) 10Bstorm: nfs-exportd: if auth errors happen, do not proceed [puppet] - 10https://gerrit.wikimedia.org/r/513105 [12:46:29] RECOVERY - tilerator on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:46:33] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.375 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:47:15] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) [12:47:23] (03CR) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki (0326 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [12:47:29] If i missed anything feel free to lmk [12:47:44] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:48:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nfs-exportd: if auth errors happen, do not proceed [puppet] - 10https://gerrit.wikimedia.org/r/513105 (owner: 10Bstorm) [12:48:50] (03CR) 10Andrew Bogott: [C: 03+1] nfs-exportd: if auth errors happen, do not proceed [puppet] - 10https://gerrit.wikimedia.org/r/513105 (owner: 10Bstorm) [12:48:58] (03CR) 10Bstorm: [C: 03+2] nfs-exportd: if auth errors happen, do not proceed [puppet] - 10https://gerrit.wikimedia.org/r/513105 (owner: 10Bstorm) [12:56:23] 10Operations, 10serviceops, 10Kubernetes: Migrate Kubernetes etcd clusters to Stretch/Buster - https://phabricator.wikimedia.org/T224574 (10MoritzMuehlenhoff) [12:56:46] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:57:04] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10WMDE-Fisch) [13:00:04] zeljkof: Your horoscope predicts another unfortunate MediaWiki train - European version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1300). [13:00:46] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [13:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:25] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:33] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:45] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [13:01:51] thanks Zppix [13:02:16] tx Zppix [13:02:21] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:27] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:52] o/ [13:02:54] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:01] train is leaving the station [13:03:23] is NFS back alive? I got some errors from cron jobs [13:03:26] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:35] Lucas_WMDE, everything's working, so it should [13:03:42] ok [13:06:53] !log stopping openstack services on cloudcontrol1003 in anticipation of a re-image [13:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:43] 10Operations: Migrate ununpentium/RT to Stretch/Buster - https://phabricator.wikimedia.org/T224575 (10MoritzMuehlenhoff) [13:08:03] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:09:09] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:09:15] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:09:24] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:09:27] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774 (10MoritzMuehlenhoff) [13:10:18] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:12:49] !log mwscript emptyUserGroup.php --wiki=fawiki 'uploader' finished (T221441) [13:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:54] T221441: Merge "uploader" group into "autoconfirmed" and "confirmed" groups in Fawiki - https://phabricator.wikimedia.org/T221441 [13:13:20] 10Operations: Upgrade install servers to Stretch/Buster - https://phabricator.wikimedia.org/T224576 (10MoritzMuehlenhoff) [13:13:36] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:14:14] (03PS1) 10Zfilipin: group1 wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513115 [13:14:18] (03CR) 10Zfilipin: [C: 03+2] group1 wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513115 (owner: 10Zfilipin) [13:14:50] 10Operations, 10serviceops, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10MoritzMuehlenhoff) [13:15:08] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:15:23] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513115 (owner: 10Zfilipin) [13:16:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:16:40] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:16:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:17:51] (03CR) 10Volans: [C: 04-1] "Couple of missing things and replied to the test one." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [13:18:03] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.7 [13:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:02] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.7 (duration: 00m 58s) [13:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:49] (03PS6) 10Jbond: CI - python: update python type checking to use mime type [puppet] - 10https://gerrit.wikimedia.org/r/510575 (https://phabricator.wikimedia.org/T144169) [13:20:30] (03CR) 10Jbond: [C: 03+2] CI - python: update python type checking to use mime type [puppet] - 10https://gerrit.wikimedia.org/r/510575 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [13:22:33] 10Operations: Migrate irc.wikimedia.org/kraz to Stretch/Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) [13:23:32] 10Operations, 10Wikimedia-Etherpad, 10serviceops: Migrate etherpad1001 to Stretch/Buster - https://phabricator.wikimedia.org/T224580 (10MoritzMuehlenhoff) [13:23:57] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:28:11] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:28:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/510871 (owner: 10Jbond) [13:31:20] 10Operations, 10cloud-services-team: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10MoritzMuehlenhoff) [13:31:46] (03PS1) 10Andrew Bogott: nova: more swapping of cloudcontrol1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/513117 (https://phabricator.wikimedia.org/T221770) [13:32:10] (03CR) 10Jbond: [C: 03+2] admin module: improve CI [puppet] - 10https://gerrit.wikimedia.org/r/510871 (owner: 10Jbond) [13:32:14] 10Operations, 10cloud-services-team: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10MoritzMuehlenhoff) [13:32:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:32:43] (03CR) 10Andrew Bogott: [C: 03+2] nova: more swapping of cloudcontrol1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/513117 (https://phabricator.wikimedia.org/T221770) (owner: 10Andrew Bogott) [13:32:47] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:33:29] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:33:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:34:39] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:35:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:35:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:44] 10Operations, 10cloud-services-team: Migrate labmon* to Stretch - https://phabricator.wikimedia.org/T224585 (10MoritzMuehlenhoff) [13:42:44] 10Operations: Migrate fermium to stretch/buster - https://phabricator.wikimedia.org/T224586 (10MoritzMuehlenhoff) [13:43:16] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:48:44] !log decommissioning restbase1015-c -- T223976 [13:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:50] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [13:51:51] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) [13:51:57] (03PS1) 10Alexandros Kosiaris: cssandra::single_instance: Remove thrift ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/513122 [13:51:59] (03PS1) 10Alexandros Kosiaris: Remove utils/new_wmf_service stuff [puppet] - 10https://gerrit.wikimedia.org/r/513123 [13:52:01] (03PS1) 10Alexandros Kosiaris: hiera_lookup: Amend the tool to support ::_role [puppet] - 10https://gerrit.wikimedia.org/r/513124 [13:52:03] (03PS1) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) [13:52:43] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Stretch/Buster - https://phabricator.wikimedia.org/T224590 (10MoritzMuehlenhoff) [13:53:19] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:54:01] (03PS2) 10Alexandros Kosiaris: cassandra::single_instance: Remove thrift ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/513122 [13:54:03] (03PS2) 10Alexandros Kosiaris: Remove utils/new_wmf_service stuff [puppet] - 10https://gerrit.wikimedia.org/r/513123 [13:54:05] (03PS2) 10Alexandros Kosiaris: hiera_lookup: Amend the tool to support ::_role [puppet] - 10https://gerrit.wikimedia.org/r/513124 [13:54:07] (03PS2) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) [13:54:25] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:54:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:32] 10Operations, 10Continuous-Integration-Infrastructure: Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) [13:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:47] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:56:30] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10Krenair) >>! In T185319#4101252, @Nuria wrote: > One of the analytics engineers. Which analytics engineer, and is the... [13:57:23] (03PS1) 10Bstorm: nfs-exportd: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/513127 [13:57:25] (03PS1) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [13:58:34] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:37] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [13:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:02] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Krenair) Is apertium part of the cxserver migration? [14:01:31] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:01:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:22] (03CR) 10DCausse: [C: 03+1] Convert cirrus data retention from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/512702 (https://phabricator.wikimedia.org/T224200) (owner: 10Gehel) [14:02:56] (03CR) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. (0314 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [14:03:12] (03PS8) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [14:03:14] (03PS8) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [14:03:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:06:25] (03PS2) 10Gehel: Convert cirrus data retention from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/512702 (https://phabricator.wikimedia.org/T224200) [14:07:13] (03PS3) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) [14:07:54] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [14:08:26] (03CR) 10Gehel: [C: 03+2] Convert cirrus data retention from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/512702 (https://phabricator.wikimedia.org/T224200) (owner: 10Gehel) [14:09:30] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:09:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:23] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1001/16795/ say LGTM, so I am going to proceed with it" [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [14:12:51] (03PS3) 10Alexandros Kosiaris: Remove utils/new_wmf_service stuff [puppet] - 10https://gerrit.wikimedia.org/r/513123 [14:13:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove utils/new_wmf_service stuff [puppet] - 10https://gerrit.wikimedia.org/r/513123 (owner: 10Alexandros Kosiaris) [14:14:04] (03PS3) 10Alexandros Kosiaris: hiera_lookup: Amend the tool to support ::_role [puppet] - 10https://gerrit.wikimedia.org/r/513124 [14:14:22] (03PS4) 10Alexandros Kosiaris: hiera_lookup: Amend the tool to support ::_role [puppet] - 10https://gerrit.wikimedia.org/r/513124 [14:14:49] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] hiera_lookup: Amend the tool to support ::_role [puppet] - 10https://gerrit.wikimedia.org/r/513124 (owner: 10Alexandros Kosiaris) [14:14:55] <_joe_> akosiaris: uhm you can also remove the other part [14:14:57] <_joe_> heh [14:15:02] <_joe_> ok I'll do it later [14:17:23] (03PS9) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [14:18:48] (03PS2) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [14:20:03] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:20:04] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:48] (03CR) 10Mobrovac: "One comment, otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [14:23:00] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) This should be easy unless apache changes are drastic like last time (as this are rather standard web frontend hosts). [14:23:02] (03PS3) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [14:24:18] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:25:34] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) Apache should be harmless, it's just different versions of Apache 2.4, but I vaguely remember an issue with something requiring PHP5. But I might be completely off track here, it's just a vag... [14:26:40] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [14:29:38] <_joe_> !log installing new service checker version on restbase in codfw [14:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:45] <_joe_> !log installing the new service checker on restbase in eqiad [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:00] (03Merged) 10jenkins-bot: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [14:32:12] (03CR) 10jenkins-bot: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [14:32:15] (03PS1) 10Andrew Bogott: cloudcontrol1003: install with Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513134 (https://phabricator.wikimedia.org/T221770) [14:32:39] !log powering off cloudcontrol1003 as one last check to see what explodes before I reimage it [14:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] (03PS2) 10Andrew Bogott: cloudcontrol1003: install with Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513134 (https://phabricator.wikimedia.org/T221770) [14:34:07] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:34:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:13] !log draining ganeti2006 for eventual reboot to pick up MDS-enabled kernel [14:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:27] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol1003: install with Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513134 (https://phabricator.wikimedia.org/T221770) (owner: 10Andrew Bogott) [14:36:57] (03CR) 10Cwhite: [C: 03+2] site: remove duplicate node definitions [puppet] - 10https://gerrit.wikimedia.org/r/512952 (owner: 10Cwhite) [14:37:04] (03PS2) 10Cwhite: site: remove duplicate node definitions [puppet] - 10https://gerrit.wikimedia.org/r/512952 [14:38:51] (03PS4) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [14:39:12] (03PS14) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [14:39:14] (03PS7) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [14:39:16] (03PS7) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [14:40:00] (03CR) 10Ayounsi: "Great, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [14:40:08] (03CR) 10jerkins-bot: [V: 04-1] Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [14:43:11] (03PS3) 10Alexandros Kosiaris: cassandra::single_instance: Remove thrift ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/513122 [14:43:13] (03PS4) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) [14:43:17] (03CR) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [14:44:00] (03PS15) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [14:44:02] (03PS8) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [14:44:04] (03PS8) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [14:45:32] !log reimaging cloudcontrol1003 T221770 [14:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:37] T221770: Upgrade cloucontrol1003/1004 to stretch/mitaka - https://phabricator.wikimedia.org/T221770 [14:47:29] !log disable et- interfaces on cr1-codfw - T224511 [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:34] T224511: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 [14:48:14] !log `request chassis fpc offline slot 0` on cr1-codfw - T224511 [14:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:21] (03CR) 10Vgutierrez: Puppet, add RPKI validation daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [14:51:46] !log `request chassis fpc online slot 0` on cr1-codfw - T224511 [14:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:30] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.458e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:53:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16797/ says validation of IPs worked fine (for both IPv4 AND IPv6) so great!" [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [14:53:42] (03PS5) 10Alexandros Kosiaris: cassandra: Support client IPs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/513125 (https://phabricator.wikimedia.org/T220401) [14:54:02] !log draining ganeti2007 for eventual reboot to pick up MDS-enabled kernel [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:17] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:54:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:09] (03CR) 10Gehel: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [15:02:14] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-new/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513139 (https://phabricator.wikimedia.org/T188327) [15:03:48] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513139 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:04:52] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513139 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:05:11] (03PS16) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [15:05:13] (03PS9) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [15:05:15] (03PS9) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [15:05:36] PROBLEM - Host kubetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:56] ^ related to the ganeti reboot, should be back up shortly [15:06:12] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-new/read-new on group 1 (T188327) (duration: 00m 57s) [15:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:18] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [15:06:40] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:06:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:09] (03PS2) 10Cwhite: role: remove prometheus backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/511734 (https://phabricator.wikimedia.org/T219825) [15:08:52] PROBLEM - PHP7 rendering on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 31002 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:10:02] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 870 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:10:30] RECOVERY - Host kubetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [15:10:48] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:11:28] !log draining ganeti2008 for eventual reboot to pick up MDS-enabled kernel [15:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:14] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:12:16] (03CR) 10Ayounsi: Puppet, add RPKI validation daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [15:12:19] (03CR) 10Ayounsi: [C: 03+2] Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [15:12:27] (03PS2) 10Gehel: Update cloudelastic storage device to dm-0, matching reality [puppet] - 10https://gerrit.wikimedia.org/r/512994 (owner: 10EBernhardson) [15:13:11] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Juniper alarm active [15:13:20] (03PS2) 10Bstorm: nfs-exportd: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/513127 [15:13:40] (03PS5) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [15:13:52] (03CR) 10Gehel: [C: 03+2] Update cloudelastic storage device to dm-0, matching reality [puppet] - 10https://gerrit.wikimedia.org/r/512994 (owner: 10EBernhardson) [15:15:58] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10herron) I do think that we would need to be consistent about what constitutes "Acknowledged" (or a similar column name). IMO the workboard transition action would indicate that the clini... [15:16:43] (03CR) 10Bstorm: [C: 03+2] nfs-exportd: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/513127 (owner: 10Bstorm) [15:16:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:16:53] (03PS3) 10Bstorm: nfs-exportd: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/513127 [15:17:54] (03PS17) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [15:18:42] (03PS1) 10Vgutierrez: redirects.dat: Get rid of www.*.wikipedia.[com,net,info] [puppet] - 10https://gerrit.wikimedia.org/r/513141 (https://phabricator.wikimedia.org/T224539) [15:20:15] (03PS6) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [15:21:53] (03CR) 10Arturo Borrero Gonzalez: "LGTM, minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513128 (owner: 10Bstorm) [15:22:25] (03PS18) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [15:22:47] (03CR) 10Jforrester: "Whee. Thank you for working on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [15:23:41] (03CR) 10BBlack: [C: 03+1] redirects.dat: Get rid of Apache specific variables [puppet] - 10https://gerrit.wikimedia.org/r/513077 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [15:23:43] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [15:25:07] (03CR) 10BBlack: [C: 03+1] redirects.dat: Get rid of www.*.wikipedia.[com,net,info] [puppet] - 10https://gerrit.wikimedia.org/r/513141 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [15:29:27] (03PS1) 10Vgutierrez: redirects.dat: Ban using .*. [puppet] - 10https://gerrit.wikimedia.org/r/513142 (https://phabricator.wikimedia.org/T133548) [15:31:06] (03PS1) 10BBlack: cache: reimage cp3044 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513143 (https://phabricator.wikimedia.org/T222937) [15:31:08] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org - https://phabricator.wikimedia.org/T224528 (10Papaul) [15:35:47] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org - https://phabricator.wikimedia.org/T224528 (10Papaul) [15:38:56] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:39:12] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:39:18] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:40:20] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:40:38] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 77026 bytes in 0.222 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:40:40] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:45:17] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10KartikMistry) @BBlack Can you please review https://gerrit.wikimedia.org/r/506043 ? [15:48:49] (03CR) 10BBlack: [C: 03+1] Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [15:49:22] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10BBlack) Done. Are we ready to deploy it already or blocked on other MW-level deploys still? [15:50:07] (03PS1) 10CRusnov: profile::netbox: Fix user that runs netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/513146 [15:53:22] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 60.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:55:57] !log upgrade and restart db2087 [15:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] 10Operations, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10Mholloway) [15:56:31] (03PS3) 10Sbisson: Revert "Hardcode korean help desk config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 [15:56:47] 10Operations, 10Operations-Software-Development, 10netbox, 10Patch-For-Review: Netbox: cable termination names report - https://phabricator.wikimedia.org/T216469 (10crusnov) Sample output: ` test_console_port_termination_names 2019-05-28T22:36:24.654134+00:00 Success 211 correctly named console port cabl... [15:56:59] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Krinkle) [15:57:01] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) [15:57:08] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [15:57:12] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Krinkle) [15:57:14] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) [15:57:24] PROBLEM - Host kubetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:27] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) 05duplicate→03Resolved [15:57:33] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [15:57:50] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [15:58:49] (03CR) 10Volans: [C: 04-1] Netbox module for Spicerack (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1600). Please do the needful. [16:00:04] stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:18] that's right [16:00:25] I'll SWAT [16:00:30] RECOVERY - Host kubetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [16:01:13] stephanebisson: note [wmf.6] 512950 Revert "Fix phan job: ignore line using JsonSerializable" [16:01:29] it is a noop for production afaik ;) [16:01:47] and thank you to have followed up on that GrowthExperiment issue from last week \o/ [16:01:49] hashar: I know but the other patch if failing without it [16:01:55] :-( [16:03:08] (03PS24) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [16:03:17] (03CR) 10CRusnov: "Thanks as always :)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:05:35] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 [16:05:41] (03PS1) 10Andrew Bogott: rabbitmq: open firewalls for rabbit communication both ways [puppet] - 10https://gerrit.wikimedia.org/r/513150 (https://phabricator.wikimedia.org/T223906) [16:06:02] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 [16:08:09] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: open firewalls for rabbit communication both ways [puppet] - 10https://gerrit.wikimedia.org/r/513150 (https://phabricator.wikimedia.org/T223906) (owner: 10Andrew Bogott) [16:08:22] (03CR) 10Jcrespo: [C: 03+1] "Can go any time now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 (owner: 10Jcrespo) [16:08:54] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [16:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:36] stephanebisson, ping me once you're done, have a patch to deploy [16:09:48] Urbanecm: ok [16:13:17] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) p:05Triage→03Normal [16:15:23] (03PS3) 10Jcrespo: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 [16:15:25] (03PS1) 10Jcrespo: mariadb: Depool db1089 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513152 [16:16:34] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 hour ago with 2 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] [16:17:11] 10Operations, 10MediaWiki-extensions-PdfHandler, 10Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10MarkAHershberger) >>! In T50007#5208406, @Schtom wrote: > i ran > ` > convert pdffile.pdf test.jp... [16:24:19] (03CR) 10Dzahn: [C: 03+1] redirects.dat: Get rid of Apache specific variables [puppet] - 10https://gerrit.wikimedia.org/r/513077 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [16:27:07] PROBLEM - puppet last run on rpki1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] [16:28:48] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) [16:31:31] (03PS1) 10Ayounsi: Routinator, make routinator user system [puppet] - 10https://gerrit.wikimedia.org/r/513155 [16:32:03] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org - https://phabricator.wikimedia.org/T224528 (10aborrero) On second thoughts, we would like to change the public VLAN for a private one, from `.wikimedia.org` to `.wmnet`. [16:32:06] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org - https://phabricator.wikimedia.org/T224528 (10Bstorm) @Papaul is it too late to suggest switching to cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? The team has been... [16:32:37] !log sbisson@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/GrowthExperiments/includes/HelpPanel/QuestionRecord.php: SWAT: [[gerrit:512950]] Revert: Fix phan job: ignore line using JsonSerializable (duration: 00m 57s) [16:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:13] RECOVERY - PHP7 rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 77073 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:33:58] (03PS3) 10Arturo Borrero Gonzalez: ldap client: sssd: introduce jessie-specific bits in sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/513091 (https://phabricator.wikimedia.org/T224558) [16:34:03] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [16:34:45] hey, I have an error on jenskins ci - https://phabricator.wikimedia.org/T224605 - who is the best person to ask what to do with it? [16:35:20] it's related to wikibase ( Failed to map interlanguage prefix es to a global site ID. [Called from Wikibase\Client\LangLinkHandler::localLinksToArray in /workspace/src/extensions/Wikibase/client/includes/LangLinkHandler.php at line 299] in /workspace/src/includes/debug/MWDebug.php) [16:35:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16799/" [puppet] - 10https://gerrit.wikimedia.org/r/513091 (https://phabricator.wikimedia.org/T224558) (owner: 10Arturo Borrero Gonzalez) [16:36:12] raynor: I just tagged it for the germans [16:36:18] Reedy, thanks, I just noticed that [16:36:46] Is it failing on an unrelated commit? [16:37:10] I need to write down how to tag wikibase projects, I never know how to tag, and the phab autocomplete shows nothing helpful when I type wikibase [16:37:23] yea, it's failing on two different commits on Minerva skin [16:37:26] totally unrelated [16:37:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:37:58] for your reference... wikidata-campsite is a good one to tag as "this needs attention of wikidata people" [16:38:03] Especially for broken stuff [16:38:26] awesome, good to know, added to my notes [16:38:57] ACKNOWLEDGEMENT - puppet last run on rpki1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] Ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/513155 [16:39:49] RECOVERY - PHP7 rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 77075 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:39:58] my understand that wikidata-campsite is for us to assign… but I’m not sure [16:40:05] my *understanding was [16:40:22] Well, you're welcome to unassign it... [16:40:30] But no one has told me off for using it for broken stuff etc [16:40:41] I don't tend to add it for "can you add this feature" [16:42:45] !log sbisson@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/GrowthExperiments/includes/HelpPanel.php: SWAT: [[gerrit:512940]] Prevent parsing of GEHelpPanelHelpDeskTitle from accessing the session (duration: 01m 00s) [16:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:33] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 (owner: 10Sbisson) [16:44:51] 10Operations, 10MediaWiki-extensions-PdfHandler, 10Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10Schtom) yes i did. i didn't recognize anything wrong with the image file. i used gnome's image vie... [16:45:41] (03Merged) 10jenkins-bot: Revert "Hardcode korean help desk config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 (owner: 10Sbisson) [16:45:43] !log sbisson@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/GrowthExperiments/includes/HelpPanel.php: SWAT: [[gerrit:512941]] Prevent parsing of GEHelpPanelHelpDeskTitle from accessing the session (duration: 00m 56s) [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:37] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10aborrero) [16:48:18] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:512942]] Revert: Hardcode korean help desk config (duration: 00m 56s) [16:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:36] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) console information: scs-a1-codwf port 40 [16:48:39] Urbanecm: I'm done [16:48:47] stephanebisson, thanks [16:49:41] (03PS5) 10Urbanecm: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) [16:50:03] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) (owner: 10Urbanecm) [16:51:10] (03Merged) 10jenkins-bot: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) (owner: 10Urbanecm) [16:53:17] RECOVERY - puppet last run on ms-be2040 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:55:59] (03PS1) 10Andrew Bogott: Revert "nova: more swapping of cloudcontrol1003/1004" [puppet] - 10https://gerrit.wikimedia.org/r/513161 [16:56:19] (03PS1) 10Andrew Bogott: Revert "Make cloudcontrol1004 the primary keystone host" [puppet] - 10https://gerrit.wikimedia.org/r/513162 [16:56:36] (03PS2) 10Andrew Bogott: Revert "nova: more swapping of cloudcontrol1003/1004" [puppet] - 10https://gerrit.wikimedia.org/r/513161 [16:58:12] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova: more swapping of cloudcontrol1003/1004" [puppet] - 10https://gerrit.wikimedia.org/r/513161 (owner: 10Andrew Bogott) [16:58:29] (03PS2) 10Andrew Bogott: Revert "Make cloudcontrol1004 the primary keystone host" [puppet] - 10https://gerrit.wikimedia.org/r/513162 [16:59:06] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Make cloudcontrol1004 the primary keystone host" [puppet] - 10https://gerrit.wikimedia.org/r/513162 (owner: 10Andrew Bogott) [16:59:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:501926|Change arwiki default user preferences]], part 1/3 (T220186) (duration: 00m 56s) [16:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:41] T220186: Change numerous Arabic Wikipedia Default User Preferences - https://phabricator.wikimedia.org/T220186 [17:00:45] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: [[:gerrit:501926|Change arwiki default user preferences]], part 2/3 (T220186) (duration: 00m 56s) [17:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513155 (owner: 10Ayounsi) [17:02:03] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: [[:gerrit:501926|Change arwiki default user preferences]], part 3/3 (T220186) (duration: 00m 56s) [17:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:14] (03CR) 10Ayounsi: [C: 03+2] Routinator, make routinator user system [puppet] - 10https://gerrit.wikimedia.org/r/513155 (owner: 10Ayounsi) [17:02:23] (03PS2) 10Ayounsi: Routinator, make routinator user system [puppet] - 10https://gerrit.wikimedia.org/r/513155 [17:06:54] (03CR) 10Cwhite: [C: 03+2] role: remove prometheus backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/511734 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [17:07:03] (03PS3) 10Cwhite: role: remove prometheus backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/511734 (https://phabricator.wikimedia.org/T219825) [17:09:01] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10aborrero) On IRC: ` 18:56 arturo: i have RAID setup that is unknow for now please discuss with your team and provide me with the... [17:09:55] RECOVERY - puppet last run on rpki1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:13:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 1 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [17:13:57] !log enable cr1-codfw:et-0/0/0 - T224511 [17:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:03] T224511: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 [17:14:34] (03CR) 10Ayounsi: [C: 03+2] Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [17:16:24] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513146 (owner: 10CRusnov) [17:17:35] (03PS10) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [17:17:45] (03CR) 10Volans: [C: 03+1] "Ready to be merged!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:24:04] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) ` $ ssh labweb1001.wikimedia.org $ sql labswiki Fatal error: Uncaught RuntimeException: RedisConnectionPool requires a Redis client... [17:31:33] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:53] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:33:23] (03CR) 10CRusnov: [C: 03+2] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:34:11] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:39:29] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:42:15] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:43:58] (03PS1) 10Catrope: MWScript.php: Mark refreshMessageBlobs.php as a global script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513174 (https://phabricator.wikimedia.org/T222539) [17:44:23] !log enable cr1-codfw:et-0/0/1 - T224511 [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:29] T224511: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 [17:45:19] (03CR) 10Bstorm: nfs-exportd: get essential openstack information from yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513128 (owner: 10Bstorm) [17:46:29] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:46:44] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) >>! In T223393#5221742, @bd808 wrote: > I'm going to poke around in puppet a bit and try to figure out what manifest we are missing... [17:49:11] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:49:23] (03PS25) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [17:50:10] (03PS1) 10Andrew Bogott: dns_floating_ip_updater: fix some broken rename changes [puppet] - 10https://gerrit.wikimedia.org/r/513176 [17:50:22] (03CR) 10Bstorm: nfs-exportd: get essential openstack information from yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513128 (owner: 10Bstorm) [17:50:43] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:51:01] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Fix user that runs netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/513146 (owner: 10CRusnov) [17:51:23] (03CR) 10Andrew Bogott: [C: 03+2] dns_floating_ip_updater: fix some broken rename changes [puppet] - 10https://gerrit.wikimedia.org/r/513176 (owner: 10Andrew Bogott) [17:51:39] (03PS2) 10CRusnov: profile::netbox: Fix user that runs netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/513146 [17:53:23] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:53:58] (03PS7) 10Bstorm: nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 [17:54:23] 10Operations, 10Operations-Software-Development, 10netbox, 10Patch-For-Review: Netbox: cable termination names report - https://phabricator.wikimedia.org/T216469 (10crusnov) After a discussion with Faidon, I think the general consensus is that DRAC (and ILO) should be an acceptable termination name for man... [17:55:10] (03CR) 10Bstorm: [C: 03+2] nfs-exportd: get essential openstack information from yaml files [puppet] - 10https://gerrit.wikimedia.org/r/513128 (owner: 10Bstorm) [17:55:33] (03CR) 10Arturo Borrero Gonzalez: nfs-exportd: get essential openstack information from yaml files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513128 (owner: 10Bstorm) [17:55:35] (03PS3) 10CRusnov: profile::netbox: Fix user that runs netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/513146 [17:56:06] (03CR) 10jenkins-bot: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:56:18] (03CR) 10CRusnov: [V: 03+2 C: 03+2] profile::netbox: Fix user that runs netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/513146 (owner: 10CRusnov) [17:58:23] (03CR) 10Krinkle: [C: 03+1] MWScript.php: Mark refreshMessageBlobs.php as a global script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513174 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [17:58:57] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1800) [18:01:25] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10colewhite) [18:08:44] 10Operations, 10Wikimedia-Mailing-lists: Mailing list admin pass reset for winedale-l (for migration off lists.wikimedia.org) - https://phabricator.wikimedia.org/T224612 (10brion) [18:12:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: install mysql client binary for cleanup operations [puppet] - 10https://gerrit.wikimedia.org/r/513180 (https://phabricator.wikimedia.org/T224610) [18:14:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/16802/" [puppet] - 10https://gerrit.wikimedia.org/r/513180 (https://phabricator.wikimedia.org/T224610) (owner: 10Arturo Borrero Gonzalez) [18:15:38] (03PS11) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [18:15:41] (03PS10) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [18:27:45] (03CR) 10Muehlenhoff: [C: 03+1] Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [18:35:37] (03PS2) 10BBlack: cache: reimage cp3044 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513143 (https://phabricator.wikimedia.org/T222937) [18:36:37] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10Cmjohnson) @elukey I do not have any 4TB disks left over in eqiad. If I understand your comment correctly you are saying it's okay to ignore this for now. [18:37:43] (03CR) 10BBlack: [C: 03+2] cache: reimage cp3044 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513143 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [18:37:48] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3044.esams.wmnet [18:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:54] 10Operations, 10ops-eqiad, 10cloud-services-team: cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Cmjohnson) a:03RobH this server is out of warranty. @RobH should we order a new battery? [18:39:08] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3044.esams.wmnet'] ` The log can be found i... [18:47:15] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Legoktm) Please. [18:50:01] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:50:46] hm [18:51:51] (03PS1) 10Urbanecm: Enable abusefilter blocking ability in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513186 (https://phabricator.wikimedia.org/T224617) [18:58:23] (03CR) 10Urbanecm: [C: 03+2] Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [18:58:46] (03PS8) 10Urbanecm: Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) [18:58:51] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1900) [19:00:44] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10Cmjohnson) Steps i have taken - I took the server down to the bare minimum operating condition 1CPU and 1DIMM and the server will still not boot. I created a support ticket with HP. 5338... [19:02:29] (03CR) 10Ayounsi: [C: 03+2] Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [19:02:39] (03PS11) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [19:03:17] (03CR) 10Jforrester: [C: 04-1] "Dependency needs to go out first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 (owner: 10Jforrester) [19:06:28] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10Cmjohnson) a ticket has been created with HP for a replacement 5338974144 [19:07:50] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:07:50] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:09:02] !log enable cr1-codfw:et-0/2/0 - T224511 [19:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:08] T224511: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 [19:10:44] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3044.esams.wmnet'] ` The log can be found i... [19:11:32] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Next step is to configure the RPKI validators on one router (eg. cr4-ulsfo): `lang=diff [edit routing-options] + validation { + group rpki { + session 10.64.32.19 { + port 3323; +... [19:31:53] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 3938 MB (2% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:32:34] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [19:32:40] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5219523, @faidon wrote: > - The cr1-eqsin serial change is a bit odd. Netbox used to have a record of wh... [19:32:40] !log phba2001 - reinstalling with stretch - upgrade from jessie (T190568) [19:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:46] T190568: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 [19:32:55] fixing that typo in SAL ..rmm [19:35:00] mutante but it's on twitter now :P [19:35:55] paladox: yea, but still fixing it for later searches in wiki [19:36:10] jouncebot: now [19:36:10] For the next 1 hour(s) and 23 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T1900) [19:36:13] jouncebot: next [19:36:13] In 0 hour(s) and 23 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T2000) [19:37:08] Reedy: It's clear. [19:37:09] PROBLEM - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:37:15] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:38:41] PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [19:38:41] PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [19:40:40] ^ that is phab2001 being reinstalled .. did not expect the pybal alerts. host itself was downtimed [19:41:08] now it's just waiting for it to be back up, so i'll leave it [19:42:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [19:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:25] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:42:49] depooled anyways [19:44:39] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 50.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:44:59] now that is not related to what i did [19:45:02] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/Collection/: Replace missing wfCollectionSuggestAction (duration: 01m 01s) [19:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:41] !log enable cr1-codfw:et-0/2/1 - T224511 [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:46] T224511: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 [19:46:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:30] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/Collection/: Replace missing wfCollectionSuggestAction (duration: 00m 57s) [19:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) [19:47:50] 10Operations, 10netops: cr1-codfw linecard failure - https://phabricator.wikimedia.org/T224511 (10ayounsi) 05Open→03Resolved Everything seems back to normal. Please reopen if the same issue happen again and we will proceed with a RMA. [19:50:13] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3044.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3044.esams.wmnet'] ` [19:52:34] (03PS24) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [19:53:14] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [19:58:28] (03CR) 10CDanis: [C: 03+2] Add a WMF-specific tool for managing db config in MediaWiki (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [20:00:04] cscott, arlolra, subbu, bearND, and halfak: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T2000). [20:01:13] (03Merged) 10jenkins-bot: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [20:02:09] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Dzahn) >>! In T222308#5154875, @Tgr wrote: > Not sure what counts as consensus for something like this Yea, that's a good question. I would say not seeing anyone speak up against it in a... [20:06:54] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10elukey) >>! In T220880#5221945, @Cmjohnson wrote: > @elukey I do not have any 4TB disks left over in eqiad. If I understand your comment correctly you are saying it's okay to ignore this for now. Correct n... [20:08:37] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:08:43] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:10:41] !log pool cp3044 (esams cache_upload ats-be) - T222937 [20:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:47] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [20:11:28] bblack: even though the wmf-reimage failed for some reason? [20:11:39] arrgg.. and just when i said that my own reimage job also failed [20:11:56] modprobe: module ehci-orion not found in modules.dep [20:12:16] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) a:03Ottomata Andrew, can you (or someone else) advise on rolling out this change for Analytics? I think the minimal viable thing is havin... [20:13:16] mutante: yeah I finished the last bits manual (re-running puppet + reboots, etc) [20:13:46] these are old nodes in esams, and our puppetization for traffic nodes has horrible dependencies/first-boot problems, I don't expect smoothness! :) [20:14:48] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10BBlack) The failed reimage was finished up manually (probably not the reimager's fault) [20:15:27] bblack: gotcha! well.. the one i tried is not finding any disks now .. after the reinstall [20:15:31] mdadm: No devices listed in conf file were found. [20:15:55] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @CDanis we are currently in offsite so this needs to wait until next week :) I'll bring this up tomorrow to my team! [20:18:21] looks like or similar to https://phabricator.wikimedia.org/T149845 [20:19:49] yes. your ticket from 2016 helped me fix it [20:20:23] !log arlolra@deploy1001 Started deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79 [20:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:30] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845 (10Dzahn) I just ran into this when reinstalling phab2001 from jessie to stretch. After installer was done and it rebooted, It fell back to busybox with "mdadm: No devices listed in conf file were found".... [20:23:02] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10mobrovac) [20:23:28] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) [20:25:07] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) [20:27:13] (03PS1) 10Reedy: Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 [20:28:09] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79 (duration: 07m 46s) [20:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:14] (03CR) 10jerkins-bot: [V: 04-1] Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [20:29:05] (03CR) 10Jforrester: "Joy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [20:29:12] (03PS1) 10Dzahn: cross-validate-accounts: also check wmde group against admins [puppet] - 10https://gerrit.wikimedia.org/r/513201 [20:29:14] (03PS1) 10Dzahn: install_server: switch phab2001 to stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/513202 (https://phabricator.wikimedia.org/T190568) [20:29:26] (03CR) 10Jforrester: "(Lack of spacing is triggering phpcs.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [20:30:09] (03PS2) 10Reedy: Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 [20:30:31] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Quiddity) +1 to close. [20:30:45] (03PS2) 10Dzahn: cross-validate-accounts: also check wmde group against admins [puppet] - 10https://gerrit.wikimedia.org/r/513201 [20:31:20] (03CR) 10jerkins-bot: [V: 04-1] Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [20:31:26] (03PS2) 10Dzahn: install_server: switch phab2001 to stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/513202 (https://phabricator.wikimedia.org/T190568) [20:31:40] (03CR) 10Dzahn: [C: 03+2] install_server: switch phab2001 to stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/513202 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [20:31:49] Reedy: `composer fix` is your friend. [20:32:00] doesn't work in the web browser [20:32:23] (03PS3) 10Reedy: Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 [20:33:02] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab2001.codfw.wmnet'] ` Of which those *... [20:34:34] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [20:35:53] !log Updated Parsoid to 8546c79 (T219927, T211125) [20:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:59] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [20:35:59] T219927: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 [20:40:34] 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10Arlolra) 05Open→03Resolved a:03Arlolra [20:40:37] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Arlolra) [20:40:41] (03PS1) 10Dzahn: introduce miscweb - assign IP for miscweb2001 [dns] - 10https://gerrit.wikimedia.org/r/513205 (https://phabricator.wikimedia.org/T224323) [20:42:57] (03PS2) 10Dzahn: introduce miscweb - assign IP for miscweb2001 [dns] - 10https://gerrit.wikimedia.org/r/513205 (https://phabricator.wikimedia.org/T224323) [20:43:41] RECOVERY - Mjolnir bulk update failure check - codfw on icinga1001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [20:43:59] (03CR) 10Dzahn: [C: 03+2] introduce miscweb - assign IP for miscweb2001 [dns] - 10https://gerrit.wikimedia.org/r/513205 (https://phabricator.wikimedia.org/T224323) (owner: 10Dzahn) [20:44:05] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) [20:48:03] (03CR) 10Bstorm: dologmsg: add -h/--help option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511043 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [20:54:29] !log creating new ganeti VM miscweb2001.codfw.wmnet with same specs as krypton.eqiad.wmnet (T224323) [20:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:34] T224323: ganeti VM request - miscweb2001 - equivalent of krypton - https://phabricator.wikimedia.org/T224323 [20:59:35] 10Operations, 10SRE-Access-Requests: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10JHedden) [21:00:13] 10Operations, 10SRE-Access-Requests: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10JHedden) [21:08:11] (03CR) 10Jforrester: [C: 03+1] Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [21:22:57] (03PS1) 10Dzahn: netboot: add miscweb[12]00[12] to partman [puppet] - 10https://gerrit.wikimedia.org/r/513215 (https://phabricator.wikimedia.org/T224247) [21:22:59] (03PS1) 10Dzahn: DHCP: add miscweb2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/513216 (https://phabricator.wikimedia.org/T224323) [21:24:09] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 26202 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [21:25:33] RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [21:27:37] (03CR) 10Dzahn: [C: 03+2] netboot: add miscweb[12]00[12] to partman [puppet] - 10https://gerrit.wikimedia.org/r/513215 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [21:27:53] (03CR) 10Dzahn: [C: 03+2] DHCP: add miscweb2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/513216 (https://phabricator.wikimedia.org/T224323) (owner: 10Dzahn) [21:28:12] (03PS2) 10Dzahn: netboot: add miscweb[12]00[12] to partman [puppet] - 10https://gerrit.wikimedia.org/r/513215 (https://phabricator.wikimedia.org/T224247) [21:30:41] (03PS2) 10Dzahn: DHCP: add miscweb2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/513216 (https://phabricator.wikimedia.org/T224323) [21:40:09] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab2001.codfw.wmnet'] ` Of which those *... [21:43:23] (03CR) 10jenkins-bot: Stop using array_merge for $wgFlaggedRevsNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512061 (owner: 10Reedy) [21:43:24] (03CR) 10jenkins-bot: mariadb: Depool db2087 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513057 (owner: 10Jcrespo) [21:43:28] (03CR) 10jenkins-bot: Add namespace aliases on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [21:43:32] (03CR) 10jenkins-bot: Fix Serbian projects' wgRestrictionLevels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [21:43:34] (03CR) 10jenkins-bot: Remove bureaucrat protection level for all Serbian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [21:43:36] (03CR) 10jenkins-bot: Enable AdvancedMobileContributions Overflow menu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509130 (https://phabricator.wikimedia.org/T223883) (owner: 10Nray) [21:43:40] (03CR) 10jenkins-bot: Enable transwiki import between sqwiki and sqwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) (owner: 10Urbanecm) [21:43:42] (03CR) 10jenkins-bot: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [21:43:45] (03CR) 10jenkins-bot: Remove uploader user group from fawiki and merge it with autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) (owner: 10Urbanecm) [21:44:15] (03CR) 10jenkins-bot: RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [21:44:35] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2037 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513021 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [21:45:03] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513115 (owner: 10Zfilipin) [21:45:26] (03CR) 10jenkins-bot: Revert "Hardcode korean help desk config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512942 (owner: 10Sbisson) [21:47:16] !log sign puppet cert request for phab2001 after reinstall (for some reason it needed me to connect to console and hit enter, reimage script itself was stuck) [21:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:48] (03CR) 10jenkins-bot: Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [21:47:57] !log installing OS on miscweb2001 VM failed at grub install step :( T224323 [21:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:02] T224323: ganeti VM request - miscweb2001 - equivalent of krypton - https://phabricator.wikimedia.org/T224323 [21:49:55] (03CR) 10jenkins-bot: Set wgLocaltimezone for euwiki to Europe/Berlin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511849 (https://phabricator.wikimedia.org/T224091) (owner: 10Urbanecm) [21:50:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [21:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:45] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513139 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [21:51:09] (03CR) 10jenkins-bot: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) (owner: 10Urbanecm) [22:05:40] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Aklapper) > Not sure what counts as consensus for something like this Maybe also send a heads-up to the list itself and link to this task (to not fragment discussion in this task and on t... [22:13:16] (03CR) 10EBernhardson: Introduce profile::analytics::search::data_drop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [22:17:45] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [22:23:36] (03PS1) 10Dzahn: cumin: add alias for new miscweb* cluster name [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) [22:24:20] (03PS2) 10Dzahn: cumin: add alias for new miscweb* cluster name [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) [22:27:23] (03CR) 10Volans: cumin: add alias for new miscweb* cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:28:51] (03PS1) 10Dzahn: add miscweb2001 to role webserver_misc_apps in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/513230 (https://phabricator.wikimedia.org/T224247) [22:30:12] (03CR) 10Dzahn: cumin: add alias for new miscweb* cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:35:37] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:36:37] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:36:55] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:37:54] (03PS1) 10Jforrester: CirrusSearch-common: Define wgCirrusSearchWeights and wgCirrusSearchNamespaceWeights locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513232 (https://phabricator.wikimedia.org/T224634) [22:38:14] (03CR) 10Dzahn: [C: 03+2] cumin: add alias for new miscweb* cluster name [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:39:18] jouncebot: now [22:39:18] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [22:39:36] (03CR) 10Jforrester: [C: 03+2] CirrusSearch-common: Define wgCirrusSearchWeights and wgCirrusSearchNamespaceWeights locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513232 (https://phabricator.wikimedia.org/T224634) (owner: 10Jforrester) [22:39:39] (03CR) 10Dzahn: cumin: add alias for new miscweb* cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513227 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:39:48] (03PS2) 10Dzahn: add miscweb2001 to role webserver_misc_apps in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/513230 (https://phabricator.wikimedia.org/T224247) [22:40:40] (03Merged) 10jenkins-bot: CirrusSearch-common: Define wgCirrusSearchWeights and wgCirrusSearchNamespaceWeights locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513232 (https://phabricator.wikimedia.org/T224634) (owner: 10Jforrester) [22:40:42] (03CR) 10Dzahn: [C: 03+2] add miscweb2001 to role webserver_misc_apps in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/513230 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:40:55] (03CR) 10jenkins-bot: CirrusSearch-common: Define wgCirrusSearchWeights and wgCirrusSearchNamespaceWeights locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513232 (https://phabricator.wikimedia.org/T224634) (owner: 10Jforrester) [22:41:42] mutante: actually, at this point was easier to use the alias inside the misc-apache one... anyway, same result [22:43:24] yea, it's the same one, by class, and alright [22:46:12] (03CR) 10Ayounsi: "> Patch Set 23:" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [22:46:35] !log jforrester@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: Hot-deploy T224634 to fix CirrusSearch for extension registration (duration: 00m 57s) [22:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:41] T224634: beta-scap-eqiad failing – "Call to mwscript eval.php returned: None" - https://phabricator.wikimedia.org/T224634 [22:47:16] (03PS24) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [22:47:52] (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [22:50:59] !log miscweb2001 - when first trying to git pull iegreview - still tries to resolve 'tin.eqiad.wmnet' which is long gone. fix is still to manually edit /srv/deployment/iegreview/iegreview-cache/cache/.git/config [22:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:15] PROBLEM - puppet last run on miscweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[ensure_present_mod_php7.0] [22:52:51] !log misweb2001 - a2dismod mpm_event ; systemctl restart apache2 to fix php7.0 dependency issue [22:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:22] (03PS25) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [22:57:35] RECOVERY - puppet last run on miscweb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:00:04] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190529T2300). [23:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:10] For whomever does SWAT - late addition: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/512949/ [23:01:47] !log phab2001 - same issue with tin.eqiad.wmnet still showing up when first trying to git clone [23:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:56] here [23:02:42] (03CR) 10Jforrester: [C: 03+2] Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:02:44] I'll SWAT. [23:02:53] (03CR) 10jerkins-bot: [V: 04-1] Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:03:42] oops rebase needed [23:03:48] 1 min [23:04:49] (03PS2) 10Jforrester: Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:04:56] mutante https://github.com/wikimedia/scap/search?utf8=✓&q=tin.eqiad.wmnet&type= [23:04:59] (03CR) 10Jforrester: [C: 03+2] Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:05:05] SMalyshev: Already landed. [23:05:40] ah, thanks! [23:06:03] (03Merged) 10jenkins-bot: Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:06:26] paladox: yea.. but i could swear we went through this before and i had once a patch for that scap [23:06:32] oh [23:06:58] (03CR) 10jenkins-bot: Enable wgSpecialSearchFormOptions on production Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512989 (https://phabricator.wikimedia.org/T55652) (owner: 10Smalyshev) [23:07:09] SMalyshev: Live on mwdebug1002. [23:07:56] checking [23:08:33] James_F: works! [23:08:56] Cool. [23:10:15] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Enable wgSpecialSearchFormOptions on production Wikidata T55652 (duration: 00m 57s) [23:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:21] T55652: Special:Search doesn't use labels and descriptions for suggestions but just the item ID - https://phabricator.wikimedia.org/T55652 [23:13:03] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:13:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [23:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:07] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:14:12] * James_F twiddles thumbs, waiting on zuul. [23:14:49] RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:15:00] (03CR) 10Jforrester: [C: 03+2] Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [23:15:09] RECOVERY - PyBal backends health check on lvs2005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:15:13] (03CR) 10Jforrester: [C: 03+2] MWScript.php: Mark refreshMessageBlobs.php as a global script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513174 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [23:15:27] !log repooled phab2001-vcs , fixes pybal / lvs alerts [23:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:51] (03CR) 10Jforrester: [C: 03+2] build: on CI only lint changed files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491564 (owner: 10Hashar) [23:16:01] (03Merged) 10jenkins-bot: Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [23:16:13] (03Merged) 10jenkins-bot: MWScript.php: Mark refreshMessageBlobs.php as a global script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513174 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [23:16:35] RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:16:47] (03CR) 10jenkins-bot: Replace FR constants with numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513200 (owner: 10Reedy) [23:16:54] (03Merged) 10jenkins-bot: build: on CI only lint changed files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491564 (owner: 10Hashar) [23:17:39] !log jforrester@deploy1001 Synchronized multiversion/MWScript.php: Mark refreshMessageBlobs.php as a global script (duration: 00m 56s) [23:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:01] !log jforrester@deploy1001 Synchronized wmf-config/flaggedrevs.php: Replace FR constants with numbers Ia52f644948 (duration: 00m 56s) [23:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:11] RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:20:33] (03PS4) 10Dzahn: phabricator: enable php-fpm in Hiera on both hosts [puppet] - 10https://gerrit.wikimedia.org/r/510597 (https://phabricator.wikimedia.org/T190568) [23:20:44] (03PS4) 10Jforrester: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [23:21:13] (03CR) 10Jforrester: "Ping. Is this good to land?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [23:22:26] (03PS5) 10Dzahn: phabricator: enable php-fpm in Hiera on both hosts [puppet] - 10https://gerrit.wikimedia.org/r/510597 (https://phabricator.wikimedia.org/T190568) [23:23:22] Krinkle: Worth testing or should I just sling out? It's on mwdebug1002. [23:23:29] checking [23:24:40] James_F: all clear [23:25:24] (03PS1) 10Dzahn: phabricator: remove role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513241 [23:26:23] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/AbuseFilter/includes/parser/AbuseFilterTokenizer.php: SWAT AbuseFilter: Tokenizer caching back to APC I8c6a4a95e (duration: 00m 54s) [23:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:50] All done. [23:26:55] Last call for SWAT? [23:28:46] Looks like https://github.com/wikimedia/mediawiki-extensions-Collection/commit/0f163c9cbecd3ebb0631 didn't make it into wmf.6 [23:29:03] (03PS3) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [23:29:11] but I can surpress it for another day. And it seems we're not allowed to spend time on it budget-wise, so.. [23:29:31] It's an easy thing to back-port if you want? [23:29:41] Yeah, but.. effort. [23:29:48] PHP Fatal Error from line 18 of /srv/mediawiki/php-1.34.0-wmf.6/extensions/Collection/includes/CollectionAjaxFunctions.php: Class undefined: SessionManager [23:29:49] Eh. It's already 16:30. Let's call a lid. [23:29:54] SWAT DONE. [23:29:58] :) [23:30:03] * Krinkle signs off [23:30:09] 00:30 here [23:30:28] thx James_F [23:30:29] utc ? [23:30:40] I have my own timezone. [23:30:47] o/ [23:31:31] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) [23:31:55] (03CR) 10Nray: "> Ping. Is this good to land?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [23:33:28] (03CR) 10Jforrester: [C: 03+2] Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [23:33:35] * James_F coughs. [23:34:11] 10Operations, 10serviceops, 10vm-requests: ganeti VM request - miscweb2001 - equivalent of krypton - https://phabricator.wikimedia.org/T224323 (10Dzahn) 05Open→03Resolved https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=miscweb2001 [23:34:14] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [23:34:22] (03Merged) 10jenkins-bot: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [23:34:26] Don't mind me. [23:34:36] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [23:35:02] 10Operations: Migrate ununpentium/RT to Stretch/Buster - https://phabricator.wikimedia.org/T224575 (10Dzahn) a:03Dzahn [23:35:43] 10Operations: Migrate ununpentium/RT to Stretch/Buster - https://phabricator.wikimedia.org/T224575 (10Dzahn) basically a duplicate of T180641 [23:35:48] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Remove wikibase sameAs A/B test config, part I (duration: 00m 56s) [23:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:10] !log jforrester@deploy1001 sync-file aborted: Remove wikibase sameAs A/B test config, part I (duration: 00m 00s) [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:40] Important to get the log message right. ;-) [23:37:10] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove wikibase sameAs A/B test config, part II (duration: 00m 56s) [23:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:56] (03PS1) 10Dzahn: phabricator: activate logmail on phab1003, disable on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513242 (https://phabricator.wikimedia.org/T190568) [23:41:24] (03PS1) 10CRusnov: profile::netbox: Tweaking report alerts [puppet] - 10https://gerrit.wikimedia.org/r/513243 [23:45:24] (03PS2) 10Dzahn: phabricator: activate logmail on phab1003, disable on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513242 (https://phabricator.wikimedia.org/T190568) [23:46:34] (03CR) 10Dzahn: [C: 03+1] profile::netbox: Tweaking report alerts [puppet] - 10https://gerrit.wikimedia.org/r/513243 (owner: 10CRusnov) [23:50:07] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Tweaking report alerts [puppet] - 10https://gerrit.wikimedia.org/r/513243 (owner: 10CRusnov) [23:53:21] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) Sorry for delay - was working on finalizing signup process. Yes, we would like to preserve the old email address if possible. Not as concerned about th... [23:53:46] (03PS3) 10Dzahn: phabricator: activate logmail on phab1003, disable on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513242 (https://phabricator.wikimedia.org/T190568) [23:54:23] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16803/" [puppet] - 10https://gerrit.wikimedia.org/r/513242 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [23:54:32] (03PS4) 10Dzahn: phabricator: activate logmail on phab1003, disable on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513242 (https://phabricator.wikimedia.org/T190568) [23:55:59] PROBLEM - Check the Netbox report-s- -puppetdb- for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:56:35] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) 05Stalled→03Open