[00:01:28] !log wikitech-static changing certbot renewalparams: authenticator = webroot (changed from standalone), install = apache (unchanged) (T214640) [00:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:36] T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 [00:01:55] !log wikitech-static certbot --dry-run renew (T214640) [00:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:12] !log wikitech-static - adding (undocumented!) option webroot-map to certbot config to use webroot authenticator with different document roots per domain while using the config file and not cli params (T214640) [00:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:21] T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 [00:24:31] Urbanecm: rabbit hole.. webroot means having to specify doc roots (per domain), they are different for the 2 domains. not even documented how to do that in config file instead of with cli params. then found others asking it in their forums and there is "webroot-map" but if i use that like others marked it as solved.. it's "failed to parse config file". then found the NEW config syntax which [00:24:37] is different.. then exception.. also nice is there is already a post hook but " such file or directory: 'service apache2 start' [00:25:32] wait, no such file or directory? [00:25:49] yea, but i mean.. if the authenticator if webroot.. we dont even want to stop and start it [00:25:52] is [00:26:01] true [00:26:02] correct, "service ..." is "no such file" [00:26:13] just reload at the end [00:26:45] also fun is if the config file has "installer = apache" and if you dry-run you are told "installer = none" [00:26:55] :) [00:28:20] FileNotFoundError: [Errno 2] No such file or directory: '/var/www/status/.well-known/acme-challenge/ [00:28:34] mutante, i use https://paste.ee/p/rLxLq on wmcz prod [00:28:37] wikitech-static.wikimedia.org.conf produced an unexpected error: [00:28:42] not sure what's the one you used [00:29:01] ^ but i am telling it that the webroot for wikitech-static is NOT the same as the one for status.wm.org [00:29:15] then just use different paths? [00:29:32] you have the new webroot_map syntax that i also use now [00:30:26] the old one was webroot-map = {"domain.com,www.domain.com":"/srv/www/customer/domain.com/www", "beta.domain.com":"/srv/www/customer/domain.com/beta"} [00:30:35] i see [00:30:49] ad the filenotfounderror thing [00:30:57] does /var/www/status exist? [00:31:04] yes, it does exist [00:31:08] if so, does /var/www/status/.well-known/acme-challenge? [00:31:14] no, it does not [00:31:32] could you try mkdir -p /var/www/status/.well-known/acme-challenge? [00:31:33] wait. it does NOW [00:31:46] wait what? [00:31:50] it got created a few minutes ago [00:31:52] and it's empty [00:31:58] what does the dry run do? [00:32:18] "Cert not due for renewal, but simulating renewal for dry run [00:32:27] "Cleaning up challenges [00:32:43] doesn't look like anything bad so far [00:34:04] to be precise.. the no such file or directory is for a file INSIDE that acme-challenge dir [00:34:13] of course that is the challenge to find that [00:34:30] but it's not getting created [00:35:28] well, what user does certbot run as? [00:35:49] (if not root, does the user has write permissions to acme-challenge?) [00:36:27] also not sure if --dry-run actually talks to the acme api [00:36:58] if not, --force-renewal should enable you to renew before its due for renewal [00:37:31] it's root. the .well-known dir has just been created by dry run and it's root owned [00:37:59] as long as force doesnt mean i end up with the existing cert revoked and new ones not being issued :p [00:38:26] god knows that [00:39:05] it's not the definition of --force-renewal, but you know... [00:39:15] ...things don't always do what docs says they do :p [00:39:39] another attempt could be to have 2 certs for 2 domains with 2 config files [00:40:05] that's exactly what wmcz does [00:40:26] this config here is doing one cert with an altname [00:40:40] and it's named after wikitech-static [00:40:43] i see [00:41:11] another solution is to just drop status.wm.o for good [00:41:18] it doesn't do anything anymore iirc [00:41:26] lol, indeed. thought "who uses that page anyways" just now [00:41:50] and for not being used "status.wm.org down" sounds way too critical :) [00:42:05] i see [00:42:20] two configs sounds like better solution [00:42:22] at least for now [00:42:42] in next 100 years, i'd consider dropping status.wm.o :D [00:43:15] well.. kind of [00:43:24] the real fix should be to replace it with a new status page [00:43:30] that shows ..status [00:43:48] maybe that gets us back to "reopen icinga to the public" :p [00:44:03] why was it closed btw? [00:44:12] some security issue years ago [00:44:25] not sure which one it was though [00:44:39] but it made us add simple auth back then [00:44:49] i see [00:44:56] icinga.wm.o looks old btw [00:45:08] s/old/stable/ :p [00:45:42] that's the same, according to some project's policies [00:45:44] 1.x is still in buster [00:46:00] so for now it's still ok [00:46:05] ok [00:46:53] but yes, at one point it will be a question of using icinga 2.x or a completely different solution for alerting [00:47:16] yeah [00:47:31] back to the prev topic, did you try --force-renewal, or decided to keep that for later? [00:48:02] i decided to stop here and continue it in the morning because i feel tired and kind of rushed because the co-working space closes [00:48:14] not currently broken but potential to mess it up more [00:48:43] i see [00:48:48] doesnt expire until September [00:49:04] great [00:49:34] well, i'm about to go to bed then, i'm in eu [00:50:07] !log wikitech-static commented out cert renewal cron job out of caution - still needs fixing but continue tomorrow [00:50:13] Urbanecm: thanks and good night then [00:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:36] yw, always happy to help :) [02:27:25] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [02:58:39] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/Permissions/PermissionManager.php: (no justification provided) (duration: 00m 57s) [02:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:15] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [03:00:37] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.13/includes/Permissions/PermissionManager.php: (no justification provided) (duration: 00m 54s) [03:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:05] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [03:13:33] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [03:42:53] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.13/extensions/CentralAuth/includes/specials/SpecialMultiLock.php: T227772 (duration: 00m 56s) [03:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:00] T227772: Fix or remove capability to override user rights for the current request - https://phabricator.wikimedia.org/T227772 [03:46:16] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CentralAuth/includes/specials/SpecialMultiLock.php: T227772 (duration: 00m 54s) [03:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:01] (03PS1) 10Marostegui: db1065: Prepare decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/523849 (https://phabricator.wikimedia.org/T227560) [05:23:48] (03CR) 10Marostegui: [C: 03+2] db1065: Prepare decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/523849 (https://phabricator.wikimedia.org/T227560) (owner: 10Marostegui) [05:24:56] !log Remove db1065 from tendril and zarcillo - T227560 [05:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:05] T227560: decommission db1065 - https://phabricator.wikimedia.org/T227560 [05:26:34] !log Stop MySQL on db1065 for decommissioning - T227560 [05:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:12] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:35:14] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Papaul) [05:39:11] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [05:43:20] 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) Implement logging of SSL Elliptic Curve used: https://github.com/apache/trafficserver/pull/5724 has been already merged into master. The API... [05:43:39] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [05:58:42] 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10elukey) @wiki_willy I'll try to disable this alarm for good, the host does not use the disk and there is no real reason to waste a spare :) [05:59:20] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) [06:01:03] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) p:05Triage→03Normal [06:20:42] !log sudo -i /usr/local/sbin/restart-php7.2-fpm on mwdebug* to reset opcache [06:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:12] !log reboot analytics1072 as attempt to clear the megacli's config (and add a new disk) [06:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:54] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:30:36] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:39:12] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:39:13] (03PS1) 10Elukey: Remove host specific hiera settings for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/523851 (https://phabricator.wikimedia.org/T226467) [06:40:13] (03CR) 10Elukey: [C: 03+2] Remove host specific hiera settings for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/523851 (https://phabricator.wikimedia.org/T226467) (owner: 10Elukey) [06:43:40] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10Marostegui) Great work, a lot less files to edit when provisioning/moving/decommissioning hosts which were very error prone! Thanks :) [06:43:48] 10Operations, 10MediaWiki-Debug-Logger, 10Release-Engineering-Team-TODO, 10Wikimedia-Logstash: Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455 (10greg) [06:44:26] 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10wiki_willy) Thanks @elukey , much appreciated! ~Willy [06:44:34] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10elukey) 05Open→03Resolved @Cmjohnson thanks a lot! I had to reboot again to be able to configure the new PD, not really sure why (the megacli commands were failing bef... [06:46:25] 10Puppet, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-greg: Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607 (10greg) 05Open→03Resolved a:03greg >>! In T143607#3413032, @EBernhardson wrote: > mwrepl has a 'bypa... [06:50:28] (03PS1) 10Muehlenhoff: Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) [06:51:26] (03PS1) 10Elukey: Add mw2224 to the list of hosts with async replication in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/523855 (https://phabricator.wikimedia.org/T225642) [06:55:30] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:57:44] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:58:19] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17423/" [puppet] - 10https://gerrit.wikimedia.org/r/523855 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [06:59:04] !log apply mcrouter async replication to mw2224 - T225642 [06:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:19] T225642: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 [06:59:48] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10Legoktm) During the initial PHP 7 preparation (when that puppet file was written), I did an... [07:00:22] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:01:28] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) @aaron mw2224 ready for testing :) [07:02:02] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10MoritzMuehlenhoff) On the Debian packaging level there are also no reverse depencies on php-... [07:09:57] (03PS1) 10Muehlenhoff: Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857 [07:13:09] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10MoritzMuehlenhoff) Graphoid is based on NodeJS, so it should be migrated to Node 10 (and thus Stretch) ei... [07:13:18] (03PS2) 10Muehlenhoff: Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857 [07:15:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857 (owner: 10Muehlenhoff) [07:26:59] (03PS3) 10Ema: 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668) [07:33:38] !log reimaging sarin for some tests [07:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:40] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:45:17] (03PS2) 10Jcrespo: mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) [07:45:32] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:45:37] (03CR) 10Jcrespo: "Please review and confirm." [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [07:46:19] !log cp-esams: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672 [07:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:26] T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672 [07:48:22] !log swift eqiad-prod: put back ms-be1043 sdk1 - T218544 [07:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:29] T218544: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 [07:50:28] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability, 10User-fgiunchedi: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) [07:51:50] (03PS1) 10DCausse: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) [07:54:37] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) @Marostegui Double checking, should we replace this or is it being decommed now? [07:56:22] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) Let's replace with an USED one for now, that host will go away "soonish" [07:59:54] (03PS2) 10Filippo Giunchedi: Remove lithium from service [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) [08:00:55] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [08:01:48] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove lithium from service [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [08:02:24] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10MoritzMuehlenhoff) Ack, this looks good to me! [08:02:54] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) Also followed up on the codfw task, but adding here for completeness as well: This looks good to me! [08:03:21] PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CRITICAL - load average: 186.88, 119.76, 55.91 https://wikitech.wikimedia.org/wiki/Swift [08:03:35] PROBLEM - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:03:36] ACKNOWLEDGEMENT - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228245 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:03:40] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10ops-monitoring-bot) [08:05:19] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10elukey) a:05elukey→03RobH [08:05:28] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) After discussing with @Pchelolo, we believe that in order to migrate the rest, we could migrate ~25% of job... [08:05:43] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) a:05elukey→03RobH [08:06:25] PROBLEM - Disk space on ms-be2019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops [08:07:00] I'll take a look at 2019 shortly [08:08:36] (03PS4) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) [08:09:32] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) (owner: 10Filippo Giunchedi) [08:10:19] RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 17.16, 65.37, 56.68 https://wikitech.wikimedia.org/wiki/Swift [08:10:38] ACKNOWLEDGEMENT - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228246 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:10:41] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228246 (10ops-monitoring-bot) [08:12:15] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:15:30] (03PS2) 10Effie Mouzeli: jobrunners: Test php7_only on 6 jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) [08:16:50] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) There is no spare USED disks. [08:16:59] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:49] T227867: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 [08:20:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I did the same test on cp2026 and seems to work as expected." [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond) [08:21:41] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:24:55] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) [08:24:57] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) [08:24:59] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) [08:25:01] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122) [08:25:03] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122) [08:25:05] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122) [08:25:07] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122) [08:25:09] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122) [08:25:11] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1005. [puppet] - 10https://gerrit.wikimedia.org/r/523874 (https://phabricator.wikimedia.org/T228122) [08:25:13] (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1006. [puppet] - 10https://gerrit.wikimedia.org/r/523875 (https://phabricator.wikimedia.org/T228122) [08:27:09] (03CR) 10Filippo Giunchedi: prometheus: wire up prometheus-varnishkafka-exporter for deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:27:58] (03PS4) 10Filippo Giunchedi: Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) [08:28:44] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:29:05] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:29:24] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:29:46] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:29:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [08:30:02] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:30:25] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:30:51] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:31:32] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1005. [puppet] - 10https://gerrit.wikimedia.org/r/523874 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:31:55] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1006. [puppet] - 10https://gerrit.wikimedia.org/r/523875 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:32:54] (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [08:34:23] (03PS1) 10Muehlenhoff: maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876 [08:34:32] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [08:34:51] (03CR) 10Gehel: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [08:35:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:36:14] (03PS7) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [08:36:37] !log Disable puppet on thumbor* in eqiad, depool and pool back to apply 523728 - T224572 [08:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:45] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [08:38:26] (03CR) 10Effie Mouzeli: [C: 03+2] Switch Thumbor pool counters in eqiad to poolcounter1004 [puppet] - 10https://gerrit.wikimedia.org/r/523728 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [08:38:36] (03PS2) 10Effie Mouzeli: Switch Thumbor pool counters in eqiad to poolcounter1004 [puppet] - 10https://gerrit.wikimedia.org/r/523728 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [08:38:46] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) We should have a bunch of disks from the decommissioned hosts, no? [08:39:28] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) Window reserved on the deployments page: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1832674&oldid=1832612 Em... [08:40:31] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:46:22] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10TheDJ) Note that getimagesize and getimagesizefromstring are [[ https://github.com/php/php-s... [08:47:32] (03PS1) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548) [08:47:34] (03PS1) 10Vgutierrez: lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548) [08:51:20] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:51:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:45] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:02:41] (03CR) 10Ema: [C: 03+2] 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [09:07:43] !log upload fifo-log-demux 0.3 to stretch-wikimedia T227668 [09:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:52] T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 [09:09:04] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10Peachey88) [09:09:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228246 (10Peachey88) [09:11:09] (03PS1) 10Marostegui: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) [09:11:59] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) [09:13:24] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [09:13:38] (03PS1) 10Ema: ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668) [09:14:34] (03PS2) 10Ema: ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668) [09:15:05] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [09:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:16:05] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [09:16:13] (03PS2) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548) [09:16:17] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) p:05Triage→03Normal [09:16:46] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [09:16:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui) [09:17:08] (03CR) 10Ema: [C: 03+2] ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [09:17:43] damn... I got puppet snipped xD [09:17:45] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [09:17:47] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [09:17:57] (03PS3) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548) [09:18:50] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui) [09:19:15] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:19:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:53] (03Merged) 10jenkins-bot: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui) [09:20:10] (03CR) 10jenkins-bot: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui) [09:21:15] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool and clarify db2045 status T227862 (duration: 00m 55s) [09:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:22] T227862: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 [09:21:43] !log cp-ats: upgrade fifo-log-demux to 0.3 T227668 [09:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:49] T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 [09:22:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui) No point on spending time with this old host, I will start its decommissioning process. [09:22:26] (03CR) 10Vgutierrez: [C: 03+2] lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:22:35] (03PS2) 10Vgutierrez: lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548) [09:23:44] !log rebooting grafana1001 to pick up MDS-enabled qemu [09:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] (03PS8) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [09:25:29] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Reedy) Mounting it where though? [09:28:50] (03CR) 10Filippo Giunchedi: prometheus: wire up prometheus-varnishkafka-exporter for deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:33:26] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [09:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:57] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [09:39:20] (03CR) 10Jbond: [C: 04-1] "The previous nits can be ignored as this is not going to be around long. however there is a bug in the change to lookup vs hiera" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:39:54] (03PS4) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:40:06] (03PS5) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:40:42] (03CR) 10Ema: [C: 03+1] "Tried with mtail 3.0.0~rc5-1~bpo9+1wmf1 and confirmed that stats do get incremented as expected." [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond) [09:40:50] (03CR) 10Jcrespo: [C: 03+2] Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:41:12] ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff T223450 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [09:43:00] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [09:47:03] (03PS3) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) [09:47:10] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1003.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201907170947_j... [09:49:29] vgutierrez: puppet runs are failing on icinga1001, seems to be caused by your set_notes comment for ncredir: [09:49:30] Error while evaluating a Function Call, The $dashboard_links and $notes_links URLs must not be URL-encoded at /etc/puppet/modules/monitoring/functions/build_notes_url.pp:18:13 at /etc/puppet/modules/profile/manifests/prometheus/alerts.pp:194 on node icinga1001.wikimedia.org [09:49:44] uh? [09:49:45] commit, not comment [09:50:30] the notes_url is 'https://wikitech.wikimedia.org/wiki/Ncredir' [09:50:38] how's that URL encoded? [09:50:43] (03PS9) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [09:51:24] I have no idea, only saw the alert in our Icinga :-) [09:51:43] yeah, thanks for pinging me [09:51:54] but I dunno what's going on here TBH [09:52:28] hmmm from build_notes_url.pp [09:52:31] # The notes link always has to come first to ensure the correct icon is used in icinga [09:52:31] # we start with `[]` so puppet knows we want a array [09:52:31] $links = [] + $notes_link + $dashboard_links [09:52:35] fixing.... [09:53:36] (03PS1) 10Vgutierrez: ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) [09:53:53] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:54:01] wonderful [09:54:14] 10Operations, 10DBA, 10Jade, 10Patch-For-Review, and 2 others: Review Jade data storage and architecture proposal [RFC] - https://phabricator.wikimedia.org/T200297 (10awight) Congratulations, looking forward to seeing this deployed! [09:54:23] ah.. rebasing issues :) [09:54:24] (03PS2) 10Vgutierrez: ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) [09:55:51] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:58:07] (03PS6) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [09:59:59] nope.. that wasn't the issue :/ [10:00:36] (03PS1) 10Filippo Giunchedi: varnish: remove varnishreqstats-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942) [10:00:38] (03PS1) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942) [10:00:44] vgutierrez: ill take a look at this im familure with the notes_url stuff and think it is unrelated to your change [10:00:50] (03PS1) 10Vgutierrez: Revert "ncredir: Fix notes_url" [puppet] - 10https://gerrit.wikimedia.org/r/523893 [10:01:06] jbond42: yeah, it looks right on the first one [10:01:11] I'm reverting my last commit [10:01:16] jbond42: thanks for figuring out the -logs /dev/stdin thing! <3 [10:01:43] ema: np, was suggested by the upstream dev [10:01:44] (03CR) 10Vgutierrez: [C: 03+2] Revert "ncredir: Fix notes_url" [puppet] - 10https://gerrit.wikimedia.org/r/523893 (owner: 10Vgutierrez) [10:01:56] vgutierrez: yes the first one is fine [10:02:33] all yours then :) [10:03:13] thanks :) [10:04:17] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:04:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:08:19] !log rebooting lithium for kernel update [10:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:57] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:18:29] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:18:30] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:14] (03PS4) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) [10:19:21] (03CR) 10Ema: [C: 03+2] ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:19:38] (03PS10) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [10:19:52] (03CR) 10Ema: [C: 03+2] ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:20:51] !log disabled icinga1001 in meta monitoring [10:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:14] (03PS1) 10Jbond: Icinga: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897 [10:21:50] jbond42: this one looks the offender BTW: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/alerts.pp#L202 [10:22:07] vgutierrez: lol see the patch i just sent above [10:22:07] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1003.eqiad.wmnet'] ` and were **ALL** successful. [10:22:14] ahaha right [10:22:22] it should say prometheus: in the commit message right? [10:22:31] I mean, it's a change on the prometheus profile [10:22:44] (03PS2) 10Jbond: Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897 [10:22:47] yep fixed [10:22:52] <3 thx [10:23:08] (03PS3) 10Jbond: Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897 [10:23:17] !log rebooting icinga1001 for kernel update [10:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:58] (03CR) 10Jbond: [C: 03+2] Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897 (owner: 10Jbond) [10:30:45] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [10:30:49] !log start rolling reboot of ms-be eqiad hosts - T225713 [10:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [10:33:03] PROBLEM - High lag on wdqs1003 is CRITICAL: 4383 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:34:13] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [10:36:53] (03PS2) 10Alexandros Kosiaris: Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [10:36:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [10:37:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks. Merging!" [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [10:41:14] !log install updated linux-image-4.9.0-9-amd64 on ms-be hosts [10:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] (03PS1) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) [10:43:58] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:45:56] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:46:41] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:47:40] 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi) [10:49:09] (03PS2) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) [10:53:45] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [10:53:49] !log re-enabled icinga1001 in meta monitoring [10:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:07] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [10:55:49] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) p:05Triage→03Normal [10:59:39] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1189 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1100). [11:00:04] matthiasmullie and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] o/ [11:00:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) during my research i noticed that puppet db was failing with the following error ` Compiling catalog for achernar.wikimed... [11:00:48] o/ [11:00:59] dcausse: mine will take a little longer [11:01:11] let;s do yours first [11:01:25] RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:01:27] matthiasmullie: sure thanks [11:01:27] want to deploy yourself, or want me to do it? [11:01:39] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [11:01:46] (03PS2) 10Jbond: puppet_compiler: Add checks for missing facts files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 (https://phabricator.wikimedia.org/T228266) [11:01:49] matthiasmullie: I can deploy [11:02:07] okay [11:02:18] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [11:03:17] (03Merged) 10jenkins-bot: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [11:03:32] (03CR) 10jenkins-bot: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [11:05:16] (03PS1) 10Vgutierrez: Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) [11:06:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) [11:08:29] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T227136: [cirrus] switch search traffic (except completion) to codfw (duration: 00m 54s) [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136 [11:10:03] RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops [11:10:34] Urbanecm Hello :-) Please ping me when you're ready & deployments are over, thanks [11:11:21] matthiasmullie: I'm done [11:11:48] k, thanks [11:12:16] Daimona: I'll let you know once I'm done, but will need a full scap, will take some time [11:12:42] Yay, no hurry, thanks [11:15:55] ugh, full scap :-) [11:16:01] slooooooow [11:16:02] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#5339132, @greg wrote: >>>! In T211881#5332195, @akosiaris wrote: >> the hardwar... [11:16:24] !log reindexing wikidata (elastic@eqiad) T227136 [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:31] T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136 [11:22:49] PROBLEM - Keyholder SSH agent on icinga1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [11:22:57] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [11:23:15] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Hiera incompatible with newer versions of puppet - https://phabricator.wikimedia.org/T227779 (10jbond) 05Open→03Resolved [11:23:21] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond) [11:25:48] matthiasmullie, why does https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/523879 need a full scap? [11:26:09] RECOVERY - Keyholder SSH agent on icinga1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [11:26:39] scap sync-file /srv/mediawiki-stagging/php-1.34.0-wmf.14/extensions/WikibaseMediaInfo should be enough [11:26:46] unless I'm overseeing something [11:27:02] hrm [11:27:12] for some reason, I thought it didn't take directories [11:27:13] you're right [11:27:13] and just in time, was about to scap :p [11:27:40] full scap is required for a) i18n/namespace changes b) new directories added [11:29:44] I suppose b) got me confused over directories :p [11:29:47] TIL! [11:29:48] thanks [11:30:04] syncing now [11:30:04] !log mlitn@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/WikibaseMediaInfo: [WikibaseMediaInfo] Revert "Add Wikidata links to statement UI elements" (duration: 00m 56s) [11:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:52] Daimona: I'm done [11:30:56] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5340418, @Reedy wrote: > Mounting it where though? Active maintenance... [11:31:35] Thanks [11:31:47] I still need 10 minutes then I'm ready to start [11:32:23] Daimona, I'll be back in about 20 mins, will ping you [11:35:27] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [11:36:14] (03PS1) 10Jbond: puppetmaster1003: convert puppetmaster1003 from spare top puppetmaster::backend [puppet] - 10https://gerrit.wikimedia.org/r/523907 (https://phabricator.wikimedia.org/T201342) [11:37:04] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: convert puppetmaster1003 from spare top puppetmaster::backend [puppet] - 10https://gerrit.wikimedia.org/r/523907 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond) [11:38:33] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) it seems that container synchronization is broken a... [11:40:15] tgr: thanks re. MassMessage :) [11:40:38] slowly learning on my (few) free time [11:45:21] Daimona, I'm back [11:45:29] Ready [11:45:35] cool! [11:45:38] So we can start? [11:45:40] Sure [11:45:56] Alright! [11:46:04] So, first of all I'd like to see another dry run [11:46:20] sure [11:46:27] Since there've been some on-wiki changes [11:46:59] PROBLEM - puppet last run on puppetmaster1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[confd],Group[gitpuppet] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [11:47:26] Running foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run [11:47:31] Thanks [11:47:39] Are you from mwmaint1002? [11:48:12] yes, running from that host [11:48:29] why do you ask? [11:48:33] Great, then I'm filtering for it on logstash [11:48:41] aha! [11:48:56] To ensure nothing wrong, although we shouldn't have problems [11:48:59] additionally we could have Daimona on the deployment group as well /mehides [11:49:24] hauskatze: I almost never need to deploy stuff :-) [11:49:55] Daimona, or restricted (that's mwlog1001, mwmaint1002 etc) [11:50:05] (03PS1) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: eature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908 [11:50:53] (03CR) 10Effie Mouzeli: [C: 04-2] "This should be merged after we have enabled the use of feature flags on jobrunners (523908)" [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [11:51:48] Daimona, currently on iewiki [11:52:19] Alright, we'll wait :) [11:52:57] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [11:53:27] Meh that error on labtestwiki again [11:54:16] i wouldn't say that's an issue [11:54:29] labtestwiki is even inaccessible for the public [11:54:36] Yeah indeed [11:54:40] Just some logspam [11:54:42] yup [11:56:21] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) [11:57:06] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [11:58:09] Daimona, https://phabricator.wikimedia.org/P8759 [11:58:20] Thanks, gonna filter and diff [11:58:29] great [11:58:32] ping me once you're ready [11:58:50] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [11:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1200) [12:00:11] Alright, same as last time, - cawikinews fixed on-wiki [12:00:26] So, ready to move on [12:00:36] I'd like to see it on cawiki only first [12:00:42] Just plz gimme a moment to open it [12:01:28] OK, ready for cawiki [12:02:07] (ping Urbanecm) [12:02:39] Daimona, ok, running on cawiki [12:03:44] Daimona, it said "Throttle parameters successfully normalized. Changed 2 rows." [12:04:00] Yep https://ca.wikipedia.org/wiki/Especial:Filtre_d%27abuses/history/9/diff/prev/164 [12:04:15] Lemme check the afh table from quarry just to be sure [12:04:25] no idea if it's good, but looks so :) [12:04:27] Uh actually, it's not on quarry [12:04:38] Daimona, you can write your query here, I can run it for you [12:04:56] Uhm let's see [12:05:15] SELECT * FROM abuse_filter_history WHERE afh_id = 164 [12:05:21] Can be posted publicly because the filter is public [12:06:00] Daimona, running [12:06:05] Ty [12:06:31] Daimona, https://phabricator.wikimedia.org/P8761 [12:06:48] Of note, the script removed "user," instead of just the comma, but I guess I wrote it like that just to keep previous behaviour. I'll have to write an on-wiki notice [12:07:08] Ok [12:07:52] Yeah, it's fine [12:08:00] Wonderful :) [12:08:06] Now viwiki alone [12:08:08] doing [12:08:52] Daimona, https://phabricator.wikimedia.org/P8762 [12:09:43] Uhm [12:10:15] what's happening? [12:10:28] Seems like no changes were made, but maybe I just opened the page too late [12:10:36] Lemme check the source [12:10:55] ok [12:11:26] !log Ran extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php for cawiki and viwiki (T209565) [12:11:28] Could you please run: SELECT * FROM abuse_filter_history WHERE afh_id = 48 [12:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:33] T209565: Dry run for normalizeThrottleParameters.php - https://phabricator.wikimedia.org/T209565 [12:11:34] certainly [12:11:37] Ty [12:11:51] I think it's fine, we don't beautify groups for old rows I guess [12:12:35] Only if they're empty, plus it added explicit 0s in the other params, so I believe it's working as intended [12:13:09] Daimona, https://phabricator.wikimedia.org/P8763 (WMF-NDA only paste, that filter looks non-public) [12:13:23] Yeah, thanks it's indeed private, forgot to say that [12:14:09] OK as I suspected, only 0s were added, which is fine [12:14:22] good [12:14:25] So... Let's unleash that little boy on all wikis! [12:14:47] doing! [12:15:12] (03PS1) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) [12:15:39] !log Running foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php in tmux session on mwmaint1002 (T209565) [12:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:54] Daimona, would you need the queries for changed history rows? [12:21:26] I don't think we do, the ones we got looked promising [12:21:48] ok [12:21:59] Did it complete? [12:22:02] not yet [12:22:07] ruwiktionary [12:22:22] Great [12:23:18] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like another case of hw raid controller lockup, I've rebooted and upgraded the controller firmware. Host came back normal! [12:24:51] Daimona, we're done! [12:24:58] Cool [12:25:00] https://phabricator.wikimedia.org/P8764 [12:25:25] Thanks, now checking [12:25:29] ok [12:25:35] let me know if you need anything [12:27:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond) [12:33:41] (03CR) 10Ema: [C: 03+2] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [12:33:49] (03PS3) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) [12:34:04] Daimona, fyi, linked the outputs on the task ftr [12:34:33] Yeah thanks [12:34:44] I just finished sample-checking some wikis, and everything looks great! [12:35:04] So well, I'll just go ahead and resolve a few tasks [12:35:14] Thanks a lot for your help! [12:36:27] !log upgrade hp raid firmware on ms-be1 hosts - T141756 [12:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [12:36:52] happy to help Daimona! [12:45:19] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) **Status** There is now a Javamelody prometheus exporter at https://gerrit.wikimedia.or... [12:45:26] (03PS2) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) [12:45:29] (03PS1) 10Jbond: standard: remove has_admin global variable [puppet] - 10https://gerrit.wikimedia.org/r/523914 [12:46:24] (03CR) 10Gehel: [C: 03+2] Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/523718 (owner: 10MSantos) [12:47:37] (03PS2) 10Gehel: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/523718 (owner: 10MSantos) [12:47:41] (03CR) 10Jbond: [C: 03+2] standard::base - reorder: Ensure admin runs early (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond) [12:47:55] anomie: please let me know when you are around, I'd like to merge https://gerrit.wikimedia.org/r/#/c/493323/ and then ask you to validate that things look good, let me know when it is a good time to do that [12:48:08] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff) [12:49:19] (03PS2) 10Gehel: maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff) [12:49:24] (03PS3) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) [12:49:55] (03PS2) 10Alexandros Kosiaris: Switch ORES pool counters for codfw to 2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [12:50:08] godog: I'm around now, but about to have some meetings. A good time for me would probably start in 2 hours and 15 minutes or so. [12:50:13] (03CR) 10Gehel: [C: 03+2] maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff) [12:51:57] (03PS3) 10Alexandros Kosiaris: Switch ORES pool counters for codfw to 2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [12:52:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [12:52:57] anomie: sounds good, I have a meeting in three hours, ping me after your meetings and we can do it [12:54:05] so for T228250 """PHP Notice: Undefined property: stdClass::$module in OATHAuth/src/OATHUserRepository.php on line 193""" [12:54:05] T228250: PHP Notice: Undefined property: stdClass::$module in OATHAuth/src/OATHUserRepository.php on line 193 - https://phabricator.wikimedia.org/T228250 [12:54:12] that seems to be solely for translatewiki.net [12:54:30] according to the task, the cause is a database change in OATHAuth extension https://phabricator.wikimedia.org/rEOATea984e5c2b2edd24f00c90766d640a65aafb75fa [12:54:31] (03PS4) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) [12:54:36] which got merged / included in 1.34.0-wmf.11 [12:54:47] so if we had the issue on wmf production we would surely have the same error [12:54:57] or would have noticed (since the task claims that users are unable to login) [12:55:33] https://phabricator.wikimedia.org/T225643 hints at a database schema change that occurred on oauthauth_users table to add columns 'module' and 'data' [12:55:40] so I guess WMF prod is covered and working fine [12:55:52] == it is not a blocker to the train ;-] [12:55:54] liw: ^^ [12:55:57] public summary! [12:56:03] ack, thanks hashar [12:56:15] the train deployment window is opening in a couple of minutes [12:57:12] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [12:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] liw: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1300). [13:03:05] dcausse: hi, in case oyu are around we found some warning with CirrusSearch :-\ [13:03:06] PHP Warning: Attempted to serialize unserializable builtin class Closure$CirrusSearch\Profile\CompletionSearchProfileRepository::__construct;3047 [13:03:41] task being filled [13:04:29] hashar: thanks, looking [13:04:36] dcausse: repro https://www.mediawiki.org//w/api.php?action=query&format=json&formatversion=2&prop=extracts%7Cpageimages%7Cdescription%7Cpageprops&generator=search&gsrlimit=3&gsrprop=redirecttitle&gsrsearch=morelike%3AWikimedia%20Apps%2FiOS%20FAQ%2Fja&gsrwhat=text&exchars=256&exintro=&exlimit=3&explaintext=&pilicense=any&pilimit=3&piprop=thumbnail&pithumbsize=120 [13:05:10] dcausse: and there is a second code path causing the issue [13:06:47] !log prometheus servers: remove varnish-upload_$dc_backend.yaml, replaced by ATS equivalent T227668 [13:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 [13:07:47] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) [13:09:28] dcausse, hashar: https://phabricator.wikimedia.org/T228276 is the ticket I just filed for this [13:09:54] liw: thanks I'm on it [13:10:46] dcausse, thanks! [13:10:53] !log cp-codfw: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672 [13:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:01] T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672 [13:11:30] seems that breaks API queryes when the generator is the search system [13:11:34] or something like that [13:13:01] one of the error had for referrer https://cho.m.wikipedia.org/wiki/Hattak [13:18:36] dcausse: fun, php7.2 does throw an exception "Serialization of 'Closure' is not allowed" [13:18:52] slightly different message :] [13:20:53] * liw is entirely out of his depth trying to understand this stuff, so treats anything as a blocker [13:21:15] (03PS5) 10Muehlenhoff: Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 [13:26:20] !log disabled puppet on Icinga hosts in preparation of adding the LDAP replicas/codfw to LVS [13:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:22] (03CR) 10Muehlenhoff: [C: 03+2] Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [13:28:08] (03PS2) 10BBlack: Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [13:28:10] (03PS1) 10BBlack: Add domain root addrs for ncredir [dns] - 10https://gerrit.wikimedia.org/r/523924 (https://phabricator.wikimedia.org/T133548) [13:28:42] (03CR) 10BBlack: [C: 03+1] Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [13:33:34] dcausse: I guess I can just +2 your change :) [13:33:52] hashar: please :) [13:35:15] (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) [13:35:18] progress! [13:35:48] (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [13:37:59] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:01] PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.252:389, 208.80.153.252:636]) https://wikitech.wikimedia.org/wiki/PyBal [13:46:51] PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 10 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:48:05] PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.252:389, 208.80.153.252:636]) https://wikitech.wikimedia.org/wiki/PyBal [13:49:47] PROBLEM - PyBal connections to etcd on lvs2005 is CRITICAL: CRITICAL: 10 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:50:26] seems to be ldap-ro.codfw.wikimedia.org. [13:50:33] moritzm: --^ [13:50:38] pretty sure it is ok [13:50:43] just wanted to triple check [13:53:58] yeah, I'd expect that's the effect of the new endpoints being available, but pybal not yet restarted [13:55:43] (03PS1) 10Ema: restbase: add TLS support via tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) [13:55:50] moritzm: indeed, all good [13:57:25] 10Operations, 10ops-codfw, 10DBA: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui) 05Open→03Declined Going to close this ticket as I have created the decommission one: {T228281} [13:57:29] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [13:57:40] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [13:57:49] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [13:58:44] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [14:00:33] (03PS1) 10Ema: secret: dummy key for restbase [labs/private] - 10https://gerrit.wikimedia.org/r/523929 (https://phabricator.wikimedia.org/T210411) [14:02:13] !log liw@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CirrusSearch/includes/Searcher.php: Do not serialize ResultsType instance T228276 (duration: 00m 55s) [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:22] T228276: PHP Warning: Attempted to serialize unserializable builtin class Closure$CirrusSearch\Profile\CompletionSearchProfileRepository::__construct;2912 - https://phabricator.wikimedia.org/T228276 [14:03:50] dcausse: solved! ( it works: https://www.mediawiki.org/w/api.php?action=query&format=json&formatversion=2&prop=extracts%7Cpageimages%7Cdescription%7Cpageprops&generator=search&gsrlimit=3&gsrprop=redirecttitle&gsrsearch=morelike%3AWikimedia%20Apps%2FiOS%20FAQ&gsrwhat=text&exchars=256&exintro=&exlimit=3&explaintext=&pilicense=any&pilimit=3&piprop=thumbnail&pithumbsize=120 ) [14:03:53] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for restbase [labs/private] - 10https://gerrit.wikimedia.org/r/523929 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:04:18] (03PS1) 10Fsero: swift: enable logging for container-sync-to-sync [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) [14:04:41] hashar: thanks! [14:05:04] liw: sorry about that! [14:05:48] jouncebot: next [14:05:48] In 1 hour(s) and 54 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600) [14:05:50] :] [14:05:53] jouncebot: now [14:05:53] For the next 0 hour(s) and 54 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1300) [14:06:15] (03PS1) 10Lars Wirzenius: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 [14:06:18] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius) [14:07:15] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius) [14:07:29] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius) [14:09:07] !log restarting pybal on backup LVSes in codfw [14:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:45] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.14 [14:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:40] !log liw@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.14 (duration: 00m 54s) [14:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:21] RECOVERY - PyBal connections to etcd on lvs2005 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:12:55] (03CR) 10Filippo Giunchedi: swift: enable logging for container-sync-to-sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:13:44] (03CR) 10Filippo Giunchedi: swift: enable logging for container-sync-to-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:14:19] RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:29] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:23:07] dcausse, no worried, thanks for the quick fix [14:24:35] (03PS2) 10Fsero: swift: enable logging for container-sync-to-sync [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) [14:25:22] (03CR) 10Fsero: swift: enable logging for container-sync-to-sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:26:13] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:27:16] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10fsero) as long @RStallman-legalteam comes back with a positive result, the clinic duty person will move this forward (thi... [14:27:42] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10fsero) p:05Triage→03Normal [14:29:08] and now there's a bunch of other new error messages in logstach [14:29:14] (03PS2) 10Fsero: Add accraze to deployment and deploy-service groups [puppet] - 10https://gerrit.wikimedia.org/r/523778 (https://phabricator.wikimedia.org/T228191) (owner: 10Halfak) [14:30:20] (03CR) 10Fsero: [C: 03+2] Add accraze to deployment and deploy-service groups [puppet] - 10https://gerrit.wikimedia.org/r/523778 (https://phabricator.wikimedia.org/T228191) (owner: 10Halfak) [14:30:54] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10RStallman-legalteam) The NDA is signed. Fine to move forward. Thanks! [14:30:59] !log repool maps1004 - T218097 [14:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:07] T218097: [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service - https://phabricator.wikimedia.org/T218097 [14:31:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! see nitpick inline for rsyslog and commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:31:30] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Add accraze to deployment and deploy-service groups. - https://phabricator.wikimedia.org/T228191 (10fsero) done. @Halfak thanks for the patch [14:31:50] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Add accraze to deployment and deploy-service groups. - https://phabricator.wikimedia.org/T228191 (10fsero) 05Open→03Resolved p:05Triage→03Normal [14:32:04] dcausse, would "PHP Fatal Error from line 21 of /srv/mediawiki/php-1.34.0-wmf.14/extensions/CirrusSearch/includes/ElasticaErrorHandler.php: Object of class Elastica\Response could not be converted to string" also fall in your wheelhouse? [14:32:34] !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init [14:32:39] liw: yes I think so, looking [14:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:30] filing task [14:34:33] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:34:48] (03PS3) 10Fsero: swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) [14:34:58] dcausse, https://phabricator.wikimedia.org/T228283 [14:35:06] !log restart pybal on lvs2002 (codfw primary) T227778 [14:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:13] T227778: Create an LDAP replica in codfw (using LVS) - https://phabricator.wikimedia.org/T227778 [14:35:22] (03CR) 10Fsero: swift: enable logging for container synchronization-to-synchronization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:35:57] (03CR) 10Fsero: [C: 03+2] swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero) [14:36:07] (03PS4) 10Fsero: swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) [14:37:39] RECOVERY - PyBal connections to etcd on lvs2002 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:38:39] (03PS1) 10Ottomata: Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) [14:39:07] RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:39:39] (03CR) 10Ottomata: [C: 03+2] Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:39:47] (03PS2) 10Ottomata: Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) [14:39:56] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:40:23] (03CR) 10Elukey: "John I have a question for you if you have time. This morning while reviewing this change I recalled that undef values in erb do not alway" [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [14:41:00] !log otto@cumin1001 START - Cookbook sre.hosts.decommission [14:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:23] !log otto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:31] (03PS1) 10Alexandros Kosiaris: Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) [14:41:50] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by otto@cumin1001 for hosts: `cloudvirtan[10... [14:43:51] (03PS6) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [14:45:11] (03PS2) 10Alexandros Kosiaris: Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) [14:45:25] (03CR) 10Muehlenhoff: [C: 03+1] Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) (owner: 10Alexandros Kosiaris) [14:45:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) (owner: 10Alexandros Kosiaris) [14:45:41] !log enabling container-sync logging T228196 [14:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:48] T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 [14:46:17] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:48:07] https://phabricator.wikimedia.org/T228286 - another blocker filed: LocalFile.php: Call to a member function getName() on a non-object (null) [14:48:24] (03PS1) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) [14:50:23] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui) [14:50:44] (03CR) 10Marostegui: [C: 04-2] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1001/17433/" [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui) [14:51:05] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10akosiaris) 05Open→03Resolved a:03akosiaris User has been ad... [14:53:24] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Papaul) a:05Papaul→03Marostegui Replaced with a used one. [14:55:14] !log updated jenkins in thirdparty/ci (stretch) and thirdparty (jessie) to 2.176.2 (T228142) [14:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:20] (03PS1) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) [14:56:28] (03PS2) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) [14:56:32] (03PS11) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013 [14:56:33] the mediawiki-new-errors dashboard on logstash is shown about 18 new errors now, mostly database, local storage, or swift - anyone around who can take a look? [14:56:34] (03PS1) 10CDanis: WIP WIP broken dbctl: schemata [puppet] - 10https://gerrit.wikimedia.org/r/523943 [14:59:52] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) Thanks - I can see it rebuilding: ` physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding) ` [15:00:08] !log poweroff ms-be2022 - T227667 [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:14] T227667: ms-be2022 misbehaving / error on boot - https://phabricator.wikimedia.org/T227667 [15:01:09] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) 05Open→03Resolved @Gehel I checked this server again today, all looks good. Resolving this task for now. We can reopen it anytime. thanks. [15:03:17] !log Depool mw2269 to reboot it - T227548 [15:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:25] T227548: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548 [15:03:38] godog: I'm ready for https://gerrit.wikimedia.org/r/#/c/493323/ if you are [15:03:54] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Reporting some info from https://github.com/ROCmSoftwarePlatfo... [15:04:32] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Alright, nodes are role spare::system and decommed/downtimed in icinga. [15:04:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [15:05:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:05:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:07] !log shutting down ms-be2022 for HW troubleshooting [15:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:38] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @cmjohnson back atcha :) [15:07:07] jouncebot: now [15:07:07] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [15:07:09] jouncebot: next [15:07:09] In 0 hour(s) and 52 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600) [15:07:21] !log upgrading CI Jenkins # T228142 [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:12] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:08:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nicely done. I would have given up on providing default values, thanks for persevering" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [15:10:00] PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:10:17] (03PS1) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 [15:10:53] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:08] (03CR) 10jerkins-bot: [V: 04-1] proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite) [15:11:35] I am waiting for some jobs to complete [15:11:44] PROBLEM - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:11:55] !log shutting down mw2250 for disk replacement [15:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] PROBLEM - High lag on wdqs2005 is CRITICAL: 5631 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:12:34] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:13:04] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:13:04] ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:13:30] PROBLEM - Host mw2250 is DOWN: PING CRITICAL - Packet loss = 100% [15:14:36] PROBLEM - Host ms-be2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:49] !log restarting swift-container-sync on ms-be* for getting logging configuration T228196 [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:56] T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 [15:15:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> This morning while reviewing this change I recalled that undef values in erb do not always correspond to false, but I might misremember." [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [15:18:46] RECOVERY - Host mw2250 is UP: PING WARNING - Packet loss = 93%, RTA = 36.15 ms [15:18:58] PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:19:02] PROBLEM - Nginx local proxy to videoscaler on mw2250 is CRITICAL: connect to address 10.192.0.76 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [15:19:30] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:19:31] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Jul-Sep-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) @Pruem ^^^ :) [15:20:02] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Papaul) a:05Papaul→03MoritzMuehlenhoff Replaced both 500GB disks with 250GB disks . All your's for re-imaging [15:20:49] (03PS1) 10CDanis: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 [15:21:32] PROBLEM - Host mw2250 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:08] RECOVERY - jenkins_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:23:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) [15:25:00] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:25:08] RECOVERY - Host ms-be2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [15:25:43] (03CR) 10Daimona Eaytoy: [C: 04-1] "(The rest still has to be sorted out)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [15:25:58] PROBLEM - Host mw2269 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:47] ^ dowtime expired [15:27:38] * Urbanecm stagging on mwdebug [15:29:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] Capture calico deployment in code. [deployment-charts] - 10https://gerrit.wikimedia.org/r/523580 (https://phabricator.wikimedia.org/T227775) (owner: 10Fsero) [15:30:23] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis) [15:31:12] ACKNOWLEDGEMENT - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): amusso apt broken due to python upgrade which triggers a replacement of zuul embedded python https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:31:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of minor comments inline, plus a question of whether we want to ship own own coredns chart under releases.wikimedia.org/charts or n" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [15:32:12] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[jenkins],Package[zuul] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:32:40] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2022 misbehaving / error on boot - https://phabricator.wikimedia.org/T227667 (10Papaul) a:05Papaul→03fgiunchedi Power drain, reboot the sever 3 times no more errors. @fgiunchedi please feel free to double check and resolve task. Thanks. [15:33:23] (03CR) 10Effie Mouzeli: profile:service_proxy: Add more hiera variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [15:35:58] RECOVERY - Host mw2269 is UP: PING OK - Packet loss = 0%, RTA = 38.11 ms [15:36:04] (03CR) 10Effie Mouzeli: [C: 03+2] profile:service_proxy: Add more hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [15:36:12] (03PS5) 10Effie Mouzeli: profile:service_proxy: Add more hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) [15:37:36] !log Deployed patch for T207094 T228284 to wmf.13 and wmf.14 [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:43] T228284: SpecialCheckUser: Call to a member function userCan() on a non-object (null) - https://phabricator.wikimedia.org/T228284 [15:39:44] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) [15:40:32] PROBLEM - Host mw2269 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:41] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10akosiaris) >>! In T224794#5339362, @wiki_willy wrote: > @akosiaris or @Volans - we can order drive replacements for this, since it's out of warranty, but I was trying to figure out how this correlates with the new... [15:40:44] RECOVERY - Host mw2269 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [15:41:23] (03CR) 10Fsero: "regarding the chart, i don't mind publishing it but this chart i do see it something pretty specific and internal of the use case." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [15:41:24] 10Operations, 10ops-codfw: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548 (10Papaul) a:05Papaul→03jijiki Power drain, SSH to mgmt is back working @jijiki Please feel free to repool server Thanks [15:42:09] 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10ayounsi) Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possible to change the CNAMEs instead? [15:42:19] I just got 15:41:23 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'php-1.34.0-wmf.14', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on [15:42:19] mw2269.codfw.wmnet returned [255]: ssh: connect to host mw2269.codfw.wmnet port 22: Connection timed out while emergency-deploying [15:42:29] (03PS2) 10CDanis: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 [15:42:31] (03PS1) 10CDanis: dbctl: part 2/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523950 [15:42:50] PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:43:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) we will be replacing lvs2006 with lvs2010 [15:43:34] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) p:05High→03Lowest [15:43:35] (03CR) 10CDanis: [C: 03+2] dbctl: part 1/2 to bring schema in line with production (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis) [15:43:36] (03CR) 10Fsero: [V: 03+2 C: 03+2] Termbox Staging - Change to internal docker repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/523771 (owner: 10Tarrow) [15:44:54] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1145 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:45:50] Urbanecm: hosts seems to have crashed ~5m ago [15:46:06] it's back up right now, we 'll have to investigate a bit what happened [15:46:10] akosiaris, thanks. Do I need to do anything (re-sync?) or will it be taken care by someone else? [15:46:27] I think you should resync just to be on the safe side [15:46:32] will do [15:46:59] !log Re-syncing patch for T207094 T228284 and wmf.14 [15:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:06] T228284: SpecialCheckUser: Call to a member function userCan() on a non-object (null) - https://phabricator.wikimedia.org/T228284 [15:47:08] (03Merged) 10jenkins-bot: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis) [15:47:13] (03CR) 10Volans: [C: 03+2] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/523950 (owner: 10CDanis) [15:47:21] thank you again akosiaris [15:47:35] Urbanecm: thanks as well [15:47:44] sync completed with no errors [15:48:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) Investigating further this is due to how `populate_puppetdb` adds entries to the datab... [15:48:54] 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) 05Open→03Resolved a:03Vgutierrez >>! In T203194#5308402, @MoritzMuehlenhoff wrote: > @Vgutierrez The firmware update on the NICs fixed this for good, right? Can we clos... [15:49:56] (03Merged) 10jenkins-bot: dbctl: part 2/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523950 (owner: 10CDanis) [15:50:31] 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Papaul) I checked the system log, no memory errors or temperature warnings but found out that the server firmware is very old. We can depool the server if possible and I can upgrade the f... [15:51:49] (03PS1) 10Aklapper: Phab: Allow viewing ogg video files inline (instead of downloading) [puppet] - 10https://gerrit.wikimedia.org/r/523952 (https://phabricator.wikimedia.org/T228225) [15:54:39] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 790.3 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:57:23] (03PS1) 10Effie Mouzeli: hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) [15:57:26] (03PS1) 10Ema: restbase: add certificate for restbase.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/523956 (https://phabricator.wikimedia.org/T210411) [15:58:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [16:00:01] 10Operations: mw2269 rebooted/crashed unexpectedly on Jul 17th ~15:30UTC - https://phabricator.wikimedia.org/T228296 (10akosiaris) p:05Triage→03Normal [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:26] (03PS1) 10Filippo Giunchedi: wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) [16:00:30] (03CR) 10EBernhardson: [C: 03+1] hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli) [16:00:38] * Urbanecm has some config patches [16:00:45] (03Abandoned) 10EBernhardson: Increase services proxy connect timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/523194 (https://phabricator.wikimedia.org/T228063) (owner: 10EBernhardson) [16:00:51] (03PS2) 10Filippo Giunchedi: wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) [16:00:54] (03PS2) 10Urbanecm: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) [16:00:59] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm) [16:01:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [16:01:22] 10Operations, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) >>! In T228275#5341475, @ayounsi wrote: > Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possibl... [16:01:47] !log copy confd package from stretch-wikimedia to buster-wikimedia [16:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:13] (03Merged) 10jenkins-bot: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm) [16:03:28] (03CR) 10jenkins-bot: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm) [16:03:38] (03PS1) 10CDanis: bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958 [16:04:12] (03PS2) 10Urbanecm: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) [16:04:18] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm) [16:05:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:523686|Enable partial blocks on dewiki]] (T228150) (duration: 00m 54s) [16:05:57] RECOVERY - puppet last run on puppetmaster1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:01] T228150: Enable partial blocks on the German Wikipedia - https://phabricator.wikimedia.org/T228150 [16:06:08] (03Merged) 10jenkins-bot: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm) [16:07:27] !log powering off cloudvirt1014 for rack move T226188 [16:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:34] T226188: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 [16:07:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Raise zh_classicalwiki requirement for autoconfirmed (T228141) (duration: 00m 55s) [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:06] T228141: Change Autoconfirmed users' age and number of edits at zh-classical wiki - https://phabricator.wikimedia.org/T228141 [16:08:09] !log Morning SWAT done [16:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:02] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) Indeed the server is not showing the Smart Storage Battery status. Lets try to upgrade the server firmware since the last upgrade was from 2015. @fgiunchedi Let me know when we can de... [16:11:10] (03CR) 10CDanis: [C: 03+2] bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958 (owner: 10CDanis) [16:11:16] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) {F29791228} [16:11:52] PROBLEM - Host cloudvirt1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:12:24] ^ paged [16:12:25] paged [16:12:25] cmjohnson1: &&& [16:12:32] :-/ [16:12:36] s/&/^/ [16:12:44] ah rack move, I see [16:12:49] RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [16:13:51] (03CR) 10jenkins-bot: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm) [16:14:02] (03Merged) 10jenkins-bot: bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958 (owner: 10CDanis) [16:14:47] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:17:02] ^ FPC 5 PEM 1 is not powered [16:17:10] ? [16:17:36] RECOVERY - Host cloudvirt1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [16:18:16] Host cloudvirt1014.mgmt is DOWN paging is a known issue: https://phabricator.wikimedia.org/T223458 [16:18:31] PEM 1 is the power supply? [16:18:48] 10Operations, 10ops-eqdfw, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy) a:03Cmjohnson [16:19:22] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy) [16:19:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [16:19:58] !log Depool mw2181 - T205240 [16:20:00] In english terms: I think that means we lost one of the redundant power inputs to one top of rack switch [16:20:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [16:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:05] T205240: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 [16:20:42] yeah correct, lost redundant power [16:20:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO (201907): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10greg) [16:20:50] FPC5 means row 5 [16:20:53] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:20:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:10] thanks :) [16:21:28] (03PS1) 10Alexandros Kosiaris: Don't page on mgmt failures [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) [16:21:32] cloudvirt1014 is in that same rack [16:21:35] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy) @Cmjohnson - not sure if there's a loose connection somewhere on backup1001, but can you check it out when you have a few cycles? This one needs to be up and runni... [16:21:38] Just wanted to know if the switch was down or only with one power input [16:21:51] the switch is probably up or there's be more alerts, I think [16:22:01] yes definitely [16:22:04] https://netbox.wikimedia.org/dcim/racks/13/ [16:22:29] (03CR) 10Andrew Bogott: [C: 03+1] Don't page on mgmt failures [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [16:22:33] (for a list of hosts in the same rack as the switch with the PEM fail) [16:23:20] oh an even better URI for that: https://netbox.wikimedia.org/dcim/devices/?rack_id=13 [16:24:48] !log shutting down mw2181 for firmware upgrade [16:24:54] bblack: the info about B5 was in the switch's logs or somewhere else? (trying to understand how to read those alarms) [16:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:14] There is "powering off cloudvirt1014 for rack move T226188" from cmjohnson1, Chris could you check if the power cables for asw2-b5-eqiad are properly seated [16:25:15] T226188: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 [16:25:51] ah! [16:26:05] (in meeting, will follow up after) [16:26:42] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Reedy) I would imagine we’re not going to be mounting a labs NFS host into a production host... [16:26:56] elukey: yeah we could probably stand to make some improvements in the alerting and UIs there... [16:27:21] (03PS3) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) [16:27:51] elukey: asw2-b-eqiad> show system alarms [16:27:51] 2019-07-17 16:11:01 UTC Major FPC 5 PEM 1 is not powered [16:28:24] XioNoX check now [16:28:36] XioNoX: ah so "FPC5 means row 5" is "rack 5" right? [16:28:38] that info isn't exposed over SNMP, so alerting would need to ssh to the device to run that command [16:28:43] for the uninitiated and/or without logging into network hardware, it is a bit of hoop jumping to follow that icinga switch alert down to a cause and a correlated physical rack location [16:28:52] rack 5, yeah :) [16:29:01] ah ok now it is clear, I was a bit confused :D [16:29:07] elukey: it does in this case, but I'm not sure it's a universal constant that FPC# == row#? [16:29:11] power cable on the pdu was loose [16:29:36] FPC# definitely correlates to the first number of interface naming when you look elsewhere though [16:29:43] (03CR) 10Elukey: [C: 03+2] aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [16:29:47] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:29:48] bblack: nah it's not, it's by convention. But at least now this is tracked in Netbox [16:30:07] asw2-b-eqiad FPC5 == interface ports named xe-5/x/y or ge-5/x/y on asw2-b-eqiad for sure [16:30:45] ahhh so PEM is Power Entry Modules, so many acronyms to learn :D [16:30:47] yep! [16:31:01] for most hardware, it's pretty trivial (manually or with links) to go from a hostname to the enclosing rack and so-on [16:31:25] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I have quickly talked with @Paladox about it. He has tried the `metrics-reporter-prometh... [16:31:27] the switch stacks are kind of a special case, where it's not reliably trivial or easy [16:31:39] https://netbox.wikimedia.org/dcim/devices/1276/ see virtual chassis-> position [16:31:43] the failure is just for asw2-b-eqiad in icinga terms [16:32:22] figuring out it's FPC 5, and that FPC5 == Rack 5, is a bit challenging [16:32:33] yes that part I wanted/want to learn :) [16:32:38] jouncebox: now [16:32:43] even with that link, nothing's explicitly saying FPC5 == Rack 5's TOR switch [16:32:44] it seems that I have a lot of info to work on now :) [16:33:17] jouncebot: now [16:33:17] For the next 0 hour(s) and 26 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600) [16:33:51] (and nothing but digging deeper on switch CLI or staring at switch syslog entries even tells you that the initial icinga alert was specifically about FPC5/PEM1) [16:35:15] I'm going to SWAT a MW patch if nobody objects [16:35:53] dcausse: no objection [16:36:52] no objection, but perhaps !log reopen the SWAT since it was already closed? (though I’m not sure if that’s usually done, I just remember seeing it) [16:36:59] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) a:03Papaul [16:37:50] !log reponing morning SWAT [16:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [16:38:31] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) @Marostegui @jcrespo can you tell if it is 2TB SATA or SAS? IF it is 2TB SATA we have some new onces onsite. [16:39:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [16:40:50] !log execute reprepro clearvanished on install1002 to clear buster-wikimedia|thirdparty/amd-rocm (not used anymore) [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10jcrespo) SAS HD disks of 1.819 TB. [16:42:17] (03CR) 10Vgutierrez: [C: 03+2] Add domain root addrs for ncredir [dns] - 10https://gerrit.wikimedia.org/r/523924 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [16:44:23] (03CR) 10Vgutierrez: [C: 03+2] Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [16:44:53] Krenair: I actually wouldn't mind being added to deployment prep so I can verify that CL [16:45:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson) [16:45:58] 10Operations, 10ops-eqiad, 10procurement: Procurement Request for 3x 4tb SAS Drives for Helium-Array - https://phabricator.wikimedia.org/T228302 (10wiki_willy) [16:45:59] dcausse, I'm currently deploying [16:46:14] (sorry for not announcing) [16:46:20] Urbanecm: ok [16:46:27] (it's for T207094) [16:46:40] Urbanecm: I have a patch just merged on CirrusSearch for wmf14 [16:46:45] ack [16:47:44] !log gehel@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [16:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:20] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Services Operations): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo) [16:48:52] !log Deployed patch for T207094 [16:48:56] dcausse, I'm done [16:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:07] Urbanecm: thanks [16:49:08] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks for back history @akosiaris , we'll get the replacement drives ordered for you via procurement #T228302. ~Willy [16:49:20] 10Operations, 10CX-cxserver, 10Citoid, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pchelolo) [16:52:30] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Services Operations): Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Pchelolo) [16:52:57] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team (Services Operations): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Pchelolo) [16:53:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I pretty much agree on not getting paged on mgmt NIC issues. +1, but I didn't test the patch in any way." [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [16:53:28] (03PS1) 10Andrew Bogott: Re-install cloudvirt1014 with Stretch and the 10g nic [puppet] - 10https://gerrit.wikimedia.org/r/523969 (https://phabricator.wikimedia.org/T226188) [16:54:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson) [16:54:22] !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CirrusSearch/includes/ElasticaErrorHandler.php: T228283: Log response data JSON on errors (duration: 00m 55s) [16:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:29] T228283: ElasticaErrorHandler.php: Object of class Elastica\Response could not be converted to string - https://phabricator.wikimedia.org/T228283 [16:55:15] (03CR) 10Andrew Bogott: [C: 03+2] Re-install cloudvirt1014 with Stretch and the 10g nic [puppet] - 10https://gerrit.wikimedia.org/r/523969 (https://phabricator.wikimedia.org/T226188) (owner: 10Andrew Bogott) [16:55:46] (03CR) 10Dzahn: [C: 03+1] "thanks! looks good to me. meant to remove paging for mgmt since a while" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [16:56:44] 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) a:05Papaul→03jcrespo Disk replaced [16:57:01] !log morning swat done [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:12] 10Operations, 10DC-Ops, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrewbogott This server is ready for you, i updated raid cfg to R10 and 2 spare di... [16:57:50] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) [16:57:59] (03CR) 10Dzahn: [C: 03+1] "https://phabricator.wikimedia.org/T223458" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [16:58:40] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team (Services Operations): Requests to MW 404 when on HTTPS - https://phabricator.wikimedia.org/T202982 (10Pchelolo) [16:58:54] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) @Papaul you can take the server down as needed. [16:59:02] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/17440/icinga1001.wikimedia.org/ but duplicate contact groups are not hurting it" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris) [17:00:14] 10Operations, 10serviceops, 10Core Platform Team (Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Pchelolo) [17:00:22] it's a little after the deploy window, but it seems I need to roll back group1 because of https://phabricator.wikimedia.org/T228292 [17:01:57] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational [17:03:36] morning swat is over, nothing else on https://wikitech.wikimedia.org/wiki/Deployments for a bit, so going ahead with rollback [17:06:30] !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to 1.34.0-wmf.13" [17:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:52] (03CR) 10Dzahn: "hmm, i don't know this. adding herron" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo) [17:08:45] (03PS2) 10Dzahn: trafficserver: add Icinga notes url for nrpe_monitor_script [puppet] - 10https://gerrit.wikimedia.org/r/521380 [17:09:08] !log shutting down restbase2009 for firmware upgrade [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:41] PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:59] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:13:15] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:13:37] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:13:37] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:14:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:11] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:13] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:16:20] oh. thanks for logging that papaul, that explains [17:16:29] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:14] PROBLEM - Host mw2181.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:18:45] still not quite explains... in the perfect world that still shouldn't happen. I'll look into it [17:21:02] PROBLEM - Host mw2181 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:32] hmm. the mw host looks unexpected [17:21:39] is that right next to it ? [17:21:48] looking at that one [17:22:28] mutante: mw2181 was log already [17:22:31] https://phabricator.wikimedia.org/T205240 [17:23:03] mutante: doing firmware upgrade on mw2181 [17:23:09] papaul: gotcha! thanks [17:27:24] RECOVERY - Host mw2181 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [17:27:43] (03PS1) 10CDanis: debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972 [17:29:00] RECOVERY - Host mw2181.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.95 ms [17:32:31] (03CR) 10Dzahn: [C: 03+2] trafficserver: add Icinga notes url for nrpe_monitor_script [puppet] - 10https://gerrit.wikimedia.org/r/521380 (owner: 10Dzahn) [17:34:33] (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 [17:34:34] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius) [17:34:44] (03CR) 10CDanis: [C: 03+2] debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972 (owner: 10CDanis) [17:35:33] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius) [17:36:10] RECOVERY - Host restbase2009 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [17:36:46] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius) [17:37:22] (03Merged) 10jenkins-bot: debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972 (owner: 10CDanis) [17:46:41] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) [17:55:01] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) p:05High→03Normal [17:55:23] (03PS1) 10Kosta Harlan: Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) [17:56:51] (03CR) 10Revi: [C: 03+1] Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [17:57:00] (03CR) 10Catrope: [C: 03+2] Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan) [18:00:57] (03CR) 10Cwhite: [C: 03+1] varnishmtail: use -logs /dev/stdin instead of -logfds 0 [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond) [18:01:00] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include stretch-wikimedia conftool/conftool_1.1.1-1_amd64.changes [18:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:14] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include buster-wikimedia conftool/conftool_1.1.1-1+deb10u1_amd64.changes [18:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:25] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include jessie-wikimedia conftool/conftool_1.1.1-1+deb8u1_amd64.changes [18:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:18] !log upgrade to python3-conftool 1.1.1-1 on mwdebug2001 [18:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:09] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s mw-canary [18:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:04] (03PS5) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [18:07:27] (03CR) 10Volans: "LGTM, a couple of nits inline." (033 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [18:07:30] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:08:34] 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Papaul) a:05Papaul→03MoritzMuehlenhoff This was a very long progress upgrading the IDRAC since the server had 1.5 I couldn't upgrade to 2.6 had to upgrade first to 1.6 than to 2.6 Bef... [18:11:20] (03PS6) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [18:12:01] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:12:32] (03CR) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [18:12:42] 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) a:05MoritzMuehlenhoff→03None [18:14:10] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) After Firmware upgrade, we still have the Smart storage battery problem since the server is out of warranty we can not have the part replaced. [18:14:50] !log mw2181 - scap pull (T205240) [18:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:57] T205240: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 [18:15:42] !log mw2181 - sudo: /usr/local/bin/mwscript: command not found on scap pull ?? [18:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:47] !log testing conftool upgrade: cdanis@mw1261.eqiad.wmnet ~ % sudo -i depool [18:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:19] !log cdanis@mw1261.eqiad.wmnet ~ % sudo -i pool [18:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:16] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:23:27] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s eqsin [18:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:03] 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) Running 'scap pull' on this host (to sync mw code before repooling) fails with "sudo: /usr/local/bin/mwscript: command not found". [18:25:21] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s ulsfo [18:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:39] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s esams [18:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:08] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s codfw [18:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s eqiad [18:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:50] (03PS1) 10Andrew Bogott: cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188) [18:40:44] (03PS2) 10Andrew Bogott: cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188) [18:41:49] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188) (owner: 10Andrew Bogott) [18:49:48] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:50:36] (03PS1) 10Cwhite: gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 [18:50:46] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:51:17] (03CR) 10jerkins-bot: [V: 04-1] gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite) [18:52:58] (03PS7) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [18:53:24] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:53:41] 10Operations, 10DC-Ops, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Andrew) 05Open→03Resolved [18:53:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [18:54:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [18:55:28] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:55:35] (03PS2) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 [18:56:28] (03CR) 10jerkins-bot: [V: 04-1] proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite) [18:59:17] (03PS1) 10CDanis: dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 [18:59:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [19:01:26] (03CR) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [19:01:56] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht) [19:03:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis) [19:04:11] (03CR) 10CDanis: [C: 03+2] dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis) [19:04:22] (03CR) 10Jbond: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite) [19:04:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2181.codfw.wmnet [19:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:10] (03PS2) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 [19:06:12] (03PS12) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 [19:06:43] (03Merged) 10jenkins-bot: dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis) [19:06:55] (03CR) 10jerkins-bot: [V: 04-1] conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (owner: 10CDanis) [19:08:15] 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) Made a separate task for the scap pull issue. Repooled the server anyways. [19:09:44] (03CR) 1020after4: [C: 03+1] Phab: Allow viewing ogg video files inline (instead of downloading) [puppet] - 10https://gerrit.wikimedia.org/r/523952 (https://phabricator.wikimedia.org/T228225) (owner: 10Aklapper) [19:10:58] 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) 05Open→03Resolved a:03Dzahn mcelog has not been written to since Oct 10 2018. No new thermal events after that. So not sure if that tells us much about the f... [19:11:10] (03PS3) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126) [19:11:12] (03PS13) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) [19:12:29] greg-g: You there? We would like to backport a LoadBalancer change to fix Wikidata dumps (https://phabricator.wikimedia.org/T228104) [19:15:37] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [19:16:22] (03PS3) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 [19:18:34] hoo: ok, swat or whenever ready, note: wmf.14 is only on group0 right now [19:19:36] greg-g: Why (and until when) is Wikidata on group0? [19:19:57] until at least tomorrow [19:20:02] https://phabricator.wikimedia.org/T220739 [19:21:51] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite) [19:24:46] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) This seems like a bad idea. Scratch is writable by all of cloud. I do not want that m... [19:25:06] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) We cross mount dumps NFS I believe to stats hosts (which might be production-ish), but... [19:25:13] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF) We're seeing this happening now on contint... [19:27:17] (03PS1) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991 [19:27:19] (03PS1) 10Dzahn: static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992 [19:27:21] (03PS1) 10Dzahn: tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993 [19:27:23] (03PS1) 10Dzahn: librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994 [19:27:25] (03PS1) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 [19:28:28] (03CR) 10jerkins-bot: [V: 04-1] microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991 (owner: 10Dzahn) [19:28:34] (03CR) 10jerkins-bot: [V: 04-1] static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992 (owner: 10Dzahn) [19:28:38] (03CR) 10jerkins-bot: [V: 04-1] tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993 (owner: 10Dzahn) [19:29:16] (03CR) 10jerkins-bot: [V: 04-1] librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994 (owner: 10Dzahn) [19:29:30] (03PS2) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991 [19:29:32] (03CR) 10jerkins-bot: [V: 04-1] xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 (owner: 10Dzahn) [19:31:31] apergos, hoo: I've been poking at it on mwdebug1002 and it doesn't seem immediately and obviously broken, but… [19:31:49] oh, you've already scapped it out there? [19:32:07] Only onto mwdebug1002, not all of prod. [19:32:12] James_F: https://phabricator.wikimedia.org/T228104#5334937 [19:32:16] yes, mwdebug, exactly [19:32:19] You (or I) can try that to verify [19:32:20] if you want [19:32:47] wikidata will be running group1 == wmf.13 code, so it won't test that. [19:32:47] (03PS8) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [19:33:16] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:33:29] we need something in group0, uh... [19:33:30] James_F: You can also test it with whatever wiki you like [19:33:35] testwiki or so [19:33:40] mediawikiwiki [19:33:44] yeah testwiki is fine too [19:33:56] It doesn't fatal. [19:34:08] 👍 [19:34:12] I'm more worried about random other crap that dies. [19:34:15] That's how it's supposed to be [19:34:24] I generally trust coders and reviewers to test the bug they're fixing. [19:34:45] I worry about the watchlist suddenly being blank, or editing a page causing a cache stampede, or… ;-D [19:34:50] Yeah, backporting LB changes is not exactly nice [19:35:16] Eh, it's only group0. [19:35:23] "only" :-D [19:35:33] Maybe we should go to wmf14 first and wait for a 1 or 2 hours? [19:35:39] If MW.org breaks I'll notice sharp-ish. [19:35:56] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer: T228104 rdbms: better handle a non-existing defaultGroup in LoadBalancer (duration: 00m 55s) [19:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:05] T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104 [19:36:08] I can be here for 1-2 hours but after that I will be a pumpkin (it's already 10:30 pm) [19:36:13] (03PS9) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [19:36:18] Yeah, I can push it later today if you want. [19:36:39] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:36:44] Yeah, let's wait a bit and see whether the wmf14 part of the world ends [19:38:52] For wmf.13 I'm going to need to fiddle to cherry-pick, fun. [19:39:06] Doesn't it apply cleanly? [19:39:19] Oh, I suppose the tests might clash [19:39:22] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10thcipriani) For that particular image I can recreate locall... [19:40:07] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5342835, @Bstorm wrote: > This seems like a bad idea. Scratch is writ... [19:40:27] (03PS10) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [19:41:19] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:41:37] ugh, sorry about that [19:41:59] It's fine. :-) [19:42:25] Just that rdbms is one of the few areas I have marked out in DANGER! tape in my mind. :-) [19:45:45] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Andrew) Can you explain in greater detail what problem you're trying to fix? I suspect that hi... [19:45:54] yeha, me too [19:51:47] (03CR) 10Dzahn: "Thanks Jcrespo. I think the best way forward is that we just say what you said here, not used in production. The reason i want to add _an" [puppet] - 10https://gerrit.wikimedia.org/r/521382 (owner: 10Dzahn) [19:53:47] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5342916, @Andrew wrote: > Can you explain in greater detail what probl... [20:00:04] cscott, arlolra, subbu, bearND, and halfak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T2000). [20:00:34] James_F: it did go out to wmf14 everywhere, right? I don't see any scap/log anything in here [20:10:14] !log accraze@deploy1001 Started deploy [ores/deploy@676f7ba]: T228331 [20:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:21] T228331: Build revert model for glwiki - https://phabricator.wikimedia.org/T228331 [20:15:58] (03CR) 10Dzahn: [C: 03+2] proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 (owner: 10Dzahn) [20:16:06] (03PS3) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 [20:23:21] 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10Papaul) p:05High→03Normal [20:26:09] (03CR) 10SBassett: [C: 03+1] "Giving this a soft +1 on behalf of the WMF Security Team with the recommendation to review Daimona's suggesting about find_in_set above an" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [20:28:07] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) I really must decline this request if that's the reason. My thinking on this is: 1. T... [20:30:11] apergos: It did. [20:30:33] ok! I have been scrying logstash just in case [20:30:55] apergos: https://tools.wmflabs.org/sal/log/AWwBbyRrOwpQ-3PkId88 [20:31:17] but not in here. hmmm....bad bots get beaten! [20:31:45] oh. I see it in here now. apparently my reading abilities have taken a nosedive [20:31:51] (03PS3) 10Ottomata: Refine mediawiki_revision_create events using schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/523791 (https://phabricator.wikimedia.org/T211248) [20:31:55] sorry for the noise! [20:32:54] (03CR) 10Ottomata: [C: 03+2] Refine mediawiki_revision_create events using schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/523791 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [20:33:01] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10bd808) 05Open→03Declined We can not mount filesystems from the Cloud Services network realm... [20:33:13] (03PS4) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 [20:35:12] !log accraze@deploy1001 Finished deploy [ores/deploy@676f7ba]: T228331 (duration: 24m 59s) [20:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:20] T228331: Build revert model for glwiki - https://phabricator.wikimedia.org/T228331 [20:37:20] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5343177, @bd808 wrote: > We can not mount filesystems from the Cloud S... [20:39:54] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10tstarling) Is this blocking deployment of PHP 7? [20:43:25] it's been an hour plus, and so far: no phab reports, no comments on mediawikiwiki itself (I'm stalking rc there), and nothing weird that I saw at any rate, in logstash [20:43:36] so, looking good so far hope-I-don't-jinx-it [20:44:18] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) >>! In T153068#5343192, @Urbanecm wrote: > That's in contrary with what @Bstorm said, b... [20:45:42] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5343230, @Bstorm wrote: >>>! In T153068#5343192, @Urbanecm wrote: >> T... [20:46:35] 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10Papaul) No memory errors showing on this system in the log . Upgrade IDRAC from 1.5 to 2.6 . We have a new BIOS version available we need to depool the server for the upgrade [20:51:22] (03PS1) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 [20:54:44] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [20:57:00] (03CR) 10Eevans: "I am following https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm for the first time, and assuming that the deployment-charts re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans) [21:00:01] (03PS1) 1020after4: Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) [21:00:14] apergos, hoo: OK, things seem fine. I'll push it to wmf.13 too. [21:00:34] okey dkoey, yeah they still look good from here [21:00:52] !log nuria@deploy1001 Started deploy [analytics/refinery@4f07755]: refinery 0.0.94 [21:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:11] (03CR) 10Aklapper: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4) [21:05:51] (03CR) 1020after4: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4) [21:07:35] !log otto@deploy1001 Started deploy [eventstreams/deploy@dbc9bbb]: Fix ?doc to use openapi instead of swagger - T227958 [21:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:43] T227958: stream.wikimedia.org/?doc returns an error page - https://phabricator.wikimedia.org/T227958 [21:10:04] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) 05Open→03Resolved looks good -- thanks @colewhite [21:10:16] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:10:27] !log otto@deploy1001 Finished deploy [eventstreams/deploy@dbc9bbb]: Fix ?doc to use openapi instead of swagger - T227958 (duration: 02m 52s) [21:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:57] (03CR) 10Paladox: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4) [21:11:29] (03PS4) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 [21:15:08] (03PS7) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [21:15:47] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Flow: Clean up accidentally-deployed debugging code for T228290 (duration: 01m 02s) [21:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:53] T228290: Fatal on Watchlist: Nesting level too deep - https://phabricator.wikimedia.org/T228290 [21:16:42] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:16:52] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.13/includes/libs/rdbms/loadbalancer: T228104 rdbms: better handle a non-existing defaultGroup in LoadBalancer (duration: 00m 55s) [21:16:52] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:14] T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104 [21:17:19] (03PS8) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [21:17:52] (03Abandoned) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite) [21:18:22] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:18:34] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:20:18] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:25:42] (03PS1) 10Krinkle: mediawiki: Fix undefined 'err' and 'message' in php7-fatal-error [puppet] - 10https://gerrit.wikimedia.org/r/524036 (https://phabricator.wikimedia.org/T228345) [21:27:24] (03PS1) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [21:31:32] (03PS1) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290) [21:32:57] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) Thinking about this a bit today, I'm no longer sure that the two puppet catalogs need to be disjoint. If... [21:34:51] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Fito) [21:36:14] (03PS2) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [21:37:15] !log nuria@deploy1001 Finished deploy [analytics/refinery@4f07755]: refinery 0.0.94 (duration: 36m 28s) [21:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:34] !log deployment aborted for refinary 0.0.94 [21:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:05] (03PS1) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 [21:40:56] (03PS2) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 [21:42:03] (03PS3) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 [21:42:24] !log started wikidata entity dumps json run on snapshot1008 [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:30] (03CR) 10jerkins-bot: [V: 04-1] nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn) [21:43:39] (03PS4) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 [21:44:53] (03PS3) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991 [21:45:35] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10greg) Adding #operations per #media-storage / @fgiunchedi... [21:45:54] (03PS2) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 [21:46:06] (03PS2) 10Dzahn: librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994 [21:46:22] (03PS2) 10Dzahn: tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993 [21:46:33] (03PS2) 10Dzahn: static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992 [21:47:42] (03CR) 10Dzahn: nrpe: add notes_url parameter to spec and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn) [21:47:56] RECOVERY - MegaRAID on es2003 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:50:15] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17449/" [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [21:51:51] (03PS3) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 [21:52:59] (03PS4) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 [21:56:02] (03PS1) 10Ayounsi: Reserve IP for syslog anycast [dns] - 10https://gerrit.wikimedia.org/r/524045 [21:57:25] (03CR) 10Dzahn: [C: 03+2] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4) [21:59:06] (03PS2) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 [22:01:11] (03PS3) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 [22:01:44] (03PS3) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [22:03:12] (03PS4) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [22:04:04] (03CR) 10Ayounsi: [C: 03+1] wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [22:16:21] !log Manually started the Wikidata RDF dumps on snapshot1008 (due to T228104) [22:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:29] T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104 [22:33:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:33:48] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) a:05MoritzMuehlenhoff→03Dzahn [22:35:09] !log reimaging mw2250 after disks have been replaced [22:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:46] (03CR) 10Dzahn: [C: 03+2] postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 (owner: 10Dzahn) [22:36:55] (03PS5) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 [22:38:46] RECOVERY - Host mw2250 is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms [22:39:40] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10elappen-WMF) [22:41:15] (03PS11) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) [22:42:08] (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:42:39] (03CR) 10Ppchelko: Add change-prop event_service_uri and point at eventgate-main (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata) [22:45:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:55:56] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [23:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:06:49] (03CR) 10Ppchelko: [C: 03+1] Add change-prop event_service_uri and point at eventgate-main (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata) [23:14:44] (03PS1) 10Ppchelko: Switch RESTBase evvnt production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T226522) [23:18:48] (03PS2) 10Ppchelko: Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T226522) [23:19:13] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @Joe I've updated the fork at https://github.com/mdholloway/nsfwoid according to your... [23:29:02] (03PS1) 10Catrope: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) [23:34:57] (03PS1) 10Ppchelko: Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T226522) [23:35:52] (03PS2) 10Ppchelko: [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T226522) [23:37:42] (03CR) 10Catrope: [C: 03+2] Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope) [23:38:46] (03Merged) 10jenkins-bot: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope) [23:39:01] (03CR) 10jenkins-bot: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope) [23:40:36] (03PS3) 10Ppchelko: [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T524055) [23:41:02] (03PS3) 10Ppchelko: Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T524055) [23:48:05] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wmgUseTheWikipediaLibrary (false everywhere, no-op) (duration: 00m 53s) [23:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:21] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Add wmgUseTheWikipediaLibrary (false everywhere, no-op) (duration: 00m 54s) [23:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:46] (03PS1) 10Catrope: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 [23:57:55] (03CR) 10Catrope: [C: 03+2] beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope) [23:58:53] (03Merged) 10jenkins-bot: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope) [23:59:14] (03CR) 10jenkins-bot: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope)