[00:01:28] <mutante>	 !log wikitech-static changing certbot renewalparams: authenticator = webroot (changed from standalone), install = apache (unchanged) (T214640)
[00:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:01:36] <stashbot>	 T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640
[00:01:55] <mutante>	 !log wikitech-static certbot --dry-run renew (T214640)
[00:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:12] <mutante>	 !log wikitech-static - adding (undocumented!) option webroot-map to certbot config to use webroot authenticator with different document roots per domain while using the config file and not cli params (T214640)
[00:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:21] <stashbot>	 T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640
[00:24:31] <mutante>	 Urbanecm: rabbit hole.. webroot means having to specify doc roots (per domain), they are different for the 2 domains. not even documented how to do that in config file instead of with cli params. then found others asking it in their forums and there is "webroot-map" but if i use that like others marked it as solved.. it's "failed to parse config file". then found the NEW config syntax which 
[00:24:37] <mutante>	 is different.. then exception.. also nice is there is already a post hook but " such file or directory: 'service apache2 start'
[00:25:32] <Urbanecm>	 wait, no such file or directory?
[00:25:49] <mutante>	 yea, but i mean.. if the authenticator if webroot.. we dont even want to stop and start it
[00:25:52] <mutante>	 is
[00:26:01] <Urbanecm>	 true
[00:26:02] <mutante>	 correct, "service ..." is "no such file"
[00:26:13] <Urbanecm>	 just reload at the end
[00:26:45] <mutante>	 also fun is if the config file has "installer = apache" and if you dry-run you are told "installer = none"
[00:26:55] <Urbanecm>	 :)
[00:28:20] <mutante>	 FileNotFoundError: [Errno 2] No such file or directory: '/var/www/status/.well-known/acme-challenge/
[00:28:34] <Urbanecm>	 mutante, i use https://paste.ee/p/rLxLq on wmcz prod
[00:28:37] <mutante>	 wikitech-static.wikimedia.org.conf produced an unexpected error:
[00:28:42] <Urbanecm>	 not sure what's the one you used
[00:29:01] <mutante>	 ^ but i am telling it that the webroot for wikitech-static is NOT the same as the one for status.wm.org
[00:29:15] <Urbanecm>	 then just use different paths?
[00:29:32] <mutante>	 you have the new webroot_map syntax that i also use now
[00:30:26] <mutante>	 the old one was webroot-map = {"domain.com,www.domain.com":"/srv/www/customer/domain.com/www", "beta.domain.com":"/srv/www/customer/domain.com/beta"}
[00:30:35] <Urbanecm>	 i see
[00:30:49] <Urbanecm>	 ad the filenotfounderror thing
[00:30:57] <Urbanecm>	 does /var/www/status exist?
[00:31:04] <mutante>	 yes, it does exist
[00:31:08] <Urbanecm>	 if so, does /var/www/status/.well-known/acme-challenge?
[00:31:14] <mutante>	 no, it does not
[00:31:32] <Urbanecm>	 could you try mkdir -p /var/www/status/.well-known/acme-challenge?
[00:31:33] <mutante>	 wait. it does NOW
[00:31:46] <Urbanecm>	 wait what?
[00:31:50] <mutante>	 it got created a few minutes ago
[00:31:52] <mutante>	 and it's empty
[00:31:58] <Urbanecm>	 what does the dry run do?
[00:32:18] <mutante>	 "Cert not due for renewal, but simulating renewal for dry run
[00:32:27] <mutante>	 "Cleaning up challenges
[00:32:43] <Urbanecm>	 doesn't look like anything bad so far
[00:34:04] <mutante>	 to be precise.. the no such file or directory is for a file INSIDE that acme-challenge dir
[00:34:13] <mutante>	 of course that is the challenge to find that
[00:34:30] <mutante>	 but it's not getting created
[00:35:28] <Urbanecm>	 well, what user does certbot run as?
[00:35:49] <Urbanecm>	 (if not root, does the user has write permissions to acme-challenge?)
[00:36:27] <Urbanecm>	 also not sure if --dry-run actually talks to the acme api
[00:36:58] <Urbanecm>	 if not, --force-renewal should enable you to renew before its due for renewal
[00:37:31] <mutante>	 it's root. the .well-known dir has just been created by dry run and it's root owned
[00:37:59] <mutante>	 as long as force doesnt mean i end up with the existing cert revoked and new ones not being issued :p
[00:38:26] <Urbanecm>	 god knows that
[00:39:05] <Urbanecm>	 it's not the definition of --force-renewal, but you know...
[00:39:15] <Urbanecm>	 ...things don't always do what docs says they do :p
[00:39:39] <mutante>	 another attempt could be to have 2 certs for 2 domains with 2 config files
[00:40:05] <Urbanecm>	 that's exactly what wmcz does
[00:40:26] <mutante>	 this config here is doing one cert with an altname
[00:40:40] <mutante>	 and it's named after wikitech-static
[00:40:43] <Urbanecm>	 i see
[00:41:11] <Urbanecm>	 another solution is to just drop status.wm.o for good
[00:41:18] <Urbanecm>	 it doesn't do anything anymore iirc
[00:41:26] <mutante>	 lol, indeed. thought "who uses that page anyways" just now
[00:41:50] <mutante>	 and for not being used "status.wm.org down" sounds way too critical :)
[00:42:05] <Urbanecm>	 i see
[00:42:20] <Urbanecm>	 two configs sounds like better solution
[00:42:22] <Urbanecm>	 at least for now
[00:42:42] <Urbanecm>	 in next 100 years, i'd consider dropping status.wm.o :D
[00:43:15] <mutante>	 well.. kind of
[00:43:24] <mutante>	 the real fix should be to replace it with a new status page
[00:43:30] <mutante>	 that shows ..status
[00:43:48] <mutante>	 maybe that gets us back to "reopen icinga to the public" :p
[00:44:03] <Urbanecm>	 why was it closed btw?
[00:44:12] <mutante>	 some security issue years ago
[00:44:25] <mutante>	 not sure which one it was though
[00:44:39] <mutante>	 but it made us add simple auth back then
[00:44:49] <Urbanecm>	 i see
[00:44:56] <Urbanecm>	 icinga.wm.o looks old btw
[00:45:08] <mutante>	 s/old/stable/ :p
[00:45:42] <Urbanecm>	 that's the same, according to some project's policies
[00:45:44] <mutante>	 1.x is still in buster
[00:46:00] <mutante>	 so for now it's still ok
[00:46:05] <Urbanecm>	 ok
[00:46:53] <mutante>	 but yes, at one point it will be a question of using icinga 2.x or a completely different solution for alerting
[00:47:16] <Urbanecm>	 yeah
[00:47:31] <Urbanecm>	 back to the prev topic, did you try --force-renewal, or decided to keep that for later?
[00:48:02] <mutante>	 i decided to stop here and continue it in the morning because i feel tired and kind of rushed because the co-working space closes
[00:48:14] <mutante>	 not currently broken but potential to mess it up more 
[00:48:43] <Urbanecm>	 i see
[00:48:48] <mutante>	 doesnt expire until September
[00:49:04] <Urbanecm>	 great
[00:49:34] <Urbanecm>	 well, i'm about to go to bed then, i'm in eu
[00:50:07] <mutante>	 !log wikitech-static commented out cert renewal cron job out of caution - still needs fixing but continue tomorrow 
[00:50:13] <mutante>	 Urbanecm: thanks and good night then
[00:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:36] <Urbanecm>	 yw, always happy to help :)
[02:27:25] <icinga-wm>	 PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[02:58:39] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/includes/Permissions/PermissionManager.php: (no justification provided) (duration: 00m 57s)
[02:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:00:15] <icinga-wm>	 RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[03:00:37] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.13/includes/Permissions/PermissionManager.php: (no justification provided) (duration: 00m 54s)
[03:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:09:05] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[03:13:33] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[03:42:53] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.13/extensions/CentralAuth/includes/specials/SpecialMultiLock.php: T227772 (duration: 00m 56s)
[03:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:43:00] <stashbot>	 T227772: Fix or remove capability to override user rights for the current request - https://phabricator.wikimedia.org/T227772
[03:46:16] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CentralAuth/includes/specials/SpecialMultiLock.php: T227772 (duration: 00m 54s)
[03:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:01] <wikibugs>	 (03PS1) 10Marostegui: db1065: Prepare decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/523849 (https://phabricator.wikimedia.org/T227560)
[05:23:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1065: Prepare decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/523849 (https://phabricator.wikimedia.org/T227560) (owner: 10Marostegui)
[05:24:56] <marostegui>	 !log Remove db1065 from tendril and zarcillo - T227560
[05:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:05] <stashbot>	 T227560: decommission db1065 - https://phabricator.wikimedia.org/T227560
[05:26:34] <marostegui>	 !log Stop MySQL on db1065 for decommissioning - T227560
[05:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:12] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[05:35:14] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Papaul)
[05:39:11] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[05:43:20] <wikibugs>	 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) Implement logging of SSL Elliptic Curve used: https://github.com/apache/trafficserver/pull/5724 has been already merged into master. The API...
[05:43:39] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[05:58:42] <wikibugs>	 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10elukey) @wiki_willy I'll try to disable this alarm for good, the host does not use the disk and there is no real reason to waste a spare :)
[05:59:20] <wikibugs>	 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui)
[06:01:03] <wikibugs>	 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) p:05Triage→03Normal
[06:20:42] <elukey>	 !log sudo -i /usr/local/sbin/restart-php7.2-fpm on mwdebug* to reset opcache
[06:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:12] <elukey>	 !log reboot analytics1072 as attempt to clear the megacli's config (and add a new disk)
[06:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:54] <icinga-wm>	 PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[06:30:36] <icinga-wm>	 PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[06:39:12] <icinga-wm>	 PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[06:39:13] <wikibugs>	 (03PS1) 10Elukey: Remove host specific hiera settings for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/523851 (https://phabricator.wikimedia.org/T226467)
[06:40:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove host specific hiera settings for analytics1072 [puppet] - 10https://gerrit.wikimedia.org/r/523851 (https://phabricator.wikimedia.org/T226467) (owner: 10Elukey)
[06:43:40] <wikibugs>	 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10Marostegui) Great work, a lot less files to edit when provisioning/moving/decommissioning hosts which were very error prone!  Thanks :)
[06:43:48] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger, 10Release-Engineering-Team-TODO, 10Wikimedia-Logstash: Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455 (10greg)
[06:44:26] <wikibugs>	 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10wiki_willy) Thanks @elukey , much appreciated!   ~Willy
[06:44:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10elukey) 05Open→03Resolved @Cmjohnson thanks a lot! I had to reboot again to be able to configure the new PD, not really sure why (the megacli commands were failing bef...
[06:46:25] <wikibugs>	 10Puppet, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-greg: Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607 (10greg) 05Open→03Resolved a:03greg >>! In T143607#3413032, @EBernhardson wrote: > mwrepl has a 'bypa...
[06:50:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640)
[06:51:26] <wikibugs>	 (03PS1) 10Elukey: Add mw2224 to the list of hosts with async replication in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/523855 (https://phabricator.wikimedia.org/T225642)
[06:55:30] <icinga-wm>	 RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[06:57:44] <icinga-wm>	 RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[06:58:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17423/" [puppet] - 10https://gerrit.wikimedia.org/r/523855 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey)
[06:59:04] <elukey>	 !log apply mcrouter async replication to mw2224 - T225642
[06:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:19] <stashbot>	 T225642: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642
[06:59:48] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10Legoktm) During the initial PHP 7 preparation (when that puppet file was written), I did an...
[07:00:22] <icinga-wm>	 RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[07:01:28] <wikibugs>	 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) @aaron mw2224 ready for testing :)
[07:02:02] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10MoritzMuehlenhoff) On the Debian packaging level there are also no reverse depencies on php-...
[07:09:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857
[07:13:09] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10MoritzMuehlenhoff) Graphoid is based on NodeJS, so it should be migrated to Node 10 (and thus Stretch) ei...
[07:13:18] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857
[07:15:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sarin to Buster [puppet] - 10https://gerrit.wikimedia.org/r/523857 (owner: 10Muehlenhoff)
[07:26:59] <wikibugs>	 (03PS3) 10Ema: 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668)
[07:33:38] <moritzm>	 !log reimaging sarin for some tests
[07:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:40] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[07:45:17] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939)
[07:45:32] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[07:45:37] <wikibugs>	 (03CR) 10Jcrespo: "Please review and confirm." [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo)
[07:46:19] <ema>	 !log cp-esams: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672
[07:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:26] <stashbot>	 T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672
[07:48:22] <godog>	 !log swift eqiad-prod: put back ms-be1043 sdk1 - T218544
[07:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:29] <stashbot>	 T218544: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544
[07:50:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability, 10User-fgiunchedi: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi)
[07:51:50] <wikibugs>	 (03PS1) 10DCausse: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136)
[07:54:37] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) @Marostegui Double checking, should we replace this or is it being decommed now?
[07:56:22] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) Let's replace with an USED one for now, that host will go away "soonish"
[07:59:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Remove lithium from service [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706)
[08:00:55] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)
[08:01:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Remove lithium from service [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi)
[08:02:24] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10MoritzMuehlenhoff) Ack, this looks good to me!
[08:02:54] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) Also followed up on the codfw task, but adding here for completeness as well: This looks good to me!
[08:03:21] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CRITICAL - load average: 186.88, 119.76, 55.91 https://wikitech.wikimedia.org/wiki/Swift
[08:03:35] <icinga-wm>	 PROBLEM - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:03:36] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228245 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:03:40] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10ops-monitoring-bot)
[08:05:19] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10elukey) a:05elukey→03RobH
[08:05:28] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) After discussing with @Pchelolo, we believe that in order to migrate the rest, we could migrate ~25% of job...
[08:05:43] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) a:05elukey→03RobH
[08:06:25] <icinga-wm>	 PROBLEM - Disk space on ms-be2019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops
[08:07:00] <godog>	 I'll take a look at 2019 shortly
[08:08:36] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145)
[08:09:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) (owner: 10Filippo Giunchedi)
[08:10:19] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 17.16, 65.37, 56.68 https://wikitech.wikimedia.org/wiki/Swift
[08:10:38] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be2019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T228246 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:10:41] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228246 (10ops-monitoring-bot)
[08:12:15] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[08:15:30] <wikibugs>	 (03PS2) 10Effie Mouzeli: jobrunners: Test php7_only on 6 jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148)
[08:16:50] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) There is no spare USED disks.
[08:16:59] <icinga-wm>	 PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[08:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:49] <stashbot>	 T227867: mw1239 memory errors  - https://phabricator.wikimedia.org/T227867
[08:20:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I did the same test on cp2026 and seems to work as expected." [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond)
[08:21:41] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[08:24:55] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122)
[08:24:57] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122)
[08:24:59] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122)
[08:25:01] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122)
[08:25:03] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122)
[08:25:05] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122)
[08:25:07] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122)
[08:25:09] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122)
[08:25:11] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1005. [puppet] - 10https://gerrit.wikimedia.org/r/523874 (https://phabricator.wikimedia.org/T228122)
[08:25:13] <wikibugs>	 (03PS1) 10Gehel: wdqs: introduced tuned journal options to wdqs1006. [puppet] - 10https://gerrit.wikimedia.org/r/523875 (https://phabricator.wikimedia.org/T228122)
[08:27:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: wire up prometheus-varnishkafka-exporter for deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[08:27:58] <wikibugs>	 (03PS4) 10Filippo Giunchedi: Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703)
[08:28:44] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:29:05] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:29:24] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:29:46] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2006. [puppet] - 10https://gerrit.wikimedia.org/r/523869 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:29:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Add rsyslog delivery actions failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/520709 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi)
[08:30:02] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1007. [puppet] - 10https://gerrit.wikimedia.org/r/523870 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:30:25] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1008. [puppet] - 10https://gerrit.wikimedia.org/r/523871 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:30:51] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2002. [puppet] - 10https://gerrit.wikimedia.org/r/523872 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:31:32] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1005. [puppet] - 10https://gerrit.wikimedia.org/r/523874 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:31:55] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs1006. [puppet] - 10https://gerrit.wikimedia.org/r/523875 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:32:54] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: introduced tuned journal options to wdqs2003. [puppet] - 10https://gerrit.wikimedia.org/r/523873 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[08:34:23] <wikibugs>	 (03PS1) 10Muehlenhoff: maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876
[08:34:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff)
[08:34:51] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse)
[08:35:29] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[08:36:14] <wikibugs>	 (03PS7) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668)
[08:36:37] <jijiki>	 !log Disable puppet on thumbor* in eqiad, depool and pool back to apply 523728 - T224572
[08:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:45] <stashbot>	 T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572
[08:38:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Switch Thumbor pool counters in eqiad to poolcounter1004 [puppet] - 10https://gerrit.wikimedia.org/r/523728 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[08:38:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: Switch Thumbor pool counters in eqiad to poolcounter1004 [puppet] - 10https://gerrit.wikimedia.org/r/523728 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff)
[08:38:46] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) We should have a bunch of disks from the decommissioned hosts, no?
[08:39:28] <wikibugs>	 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) Window reserved on the deployments page: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1832674&oldid=1832612 Em...
[08:40:31] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[08:46:22] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10TheDJ) Note that getimagesize and getimagesizefromstring are [[ https://github.com/php/php-s...
[08:47:32] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548)
[08:47:34] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548)
[08:51:20] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:51:22] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:45] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[09:02:41] <wikibugs>	 (03CR) 10Ema: [C: 03+2] 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[09:07:43] <ema>	 !log upload fifo-log-demux 0.3 to stretch-wikimedia T227668
[09:07:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:52] <stashbot>	 T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668
[09:09:04] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10Peachey88)
[09:09:06] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228246 (10Peachey88)
[09:11:09] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862)
[09:11:59] <wikibugs>	 (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122)
[09:13:24] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1003. [puppet] - 10https://gerrit.wikimedia.org/r/523866 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[09:13:38] <wikibugs>	 (03PS1) 10Ema: ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668)
[09:14:34] <wikibugs>	 (03PS2) 10Ema: ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668)
[09:15:05] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer
[09:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[09:16:05] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[09:16:13] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548)
[09:16:17] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) p:05Triage→03Normal
[09:16:46] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[09:16:51] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui)
[09:17:08] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: pass -socket and -regexp to fifo-log-tailer [puppet] - 10https://gerrit.wikimedia.org/r/523881 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[09:17:43] <vgutierrez>	 damn... I got puppet snipped xD
[09:17:45] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[09:17:47] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[09:17:57] <wikibugs>	 (03PS3) 10Vgutierrez: ncredir: Set notes_url for https_ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523877 (https://phabricator.wikimedia.org/T133548)
[09:18:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui)
[09:19:15] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:19:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui)
[09:20:10] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Clarify db2045 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523880 (https://phabricator.wikimedia.org/T227862) (owner: 10Marostegui)
[09:21:15] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool and clarify db2045 status T227862 (duration: 00m 55s)
[09:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:22] <stashbot>	 T227862: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862
[09:21:43] <ema>	 !log cp-ats: upgrade fifo-log-demux to 0.3 T227668
[09:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:49] <stashbot>	 T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668
[09:22:06] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui) No point on spending time with this old host, I will start its decommissioning process.
[09:22:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[09:22:35] <wikibugs>	 (03PS2) 10Vgutierrez: lvs: Enable paging for ncredir checks [puppet] - 10https://gerrit.wikimedia.org/r/523878 (https://phabricator.wikimedia.org/T133548)
[09:23:44] <moritzm>	 !log rebooting grafana1001 to pick up MDS-enabled qemu
[09:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:11] <wikibugs>	 (03PS8) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668)
[09:25:29] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Reedy) Mounting it where though?
[09:28:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: wire up prometheus-varnishkafka-exporter for deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[09:33:26] <logmsgbot>	 !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[09:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:57] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer
[09:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:59] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi)
[09:39:20] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "The previous nits can be ignored as this is not going to be around long.  however there is a bug in the change to lookup vs hiera" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[09:39:54] <wikibugs>	 (03PS4) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff)
[09:40:06] <wikibugs>	 (03PS5) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff)
[09:40:42] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Tried with mtail 3.0.0~rc5-1~bpo9+1wmf1 and confirmed that stats do get incremented as expected." [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond)
[09:40:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff)
[09:41:12] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff T223450 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[09:43:00] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi)
[09:47:03] <wikibugs>	 (03PS3) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668)
[09:47:10] <wikibugs>	 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster1003.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201907170947_j...
[09:49:29] <moritzm>	 vgutierrez: puppet runs are failing on icinga1001, seems to be caused by your set_notes comment for ncredir:
[09:49:30] <moritzm>	 Error while evaluating a Function Call, The $dashboard_links and $notes_links URLs must not be URL-encoded at /etc/puppet/modules/monitoring/functions/build_notes_url.pp:18:13 at /etc/puppet/modules/profile/manifests/prometheus/alerts.pp:194 on node icinga1001.wikimedia.org
[09:49:44] <vgutierrez>	 uh?
[09:49:45] <moritzm>	 commit, not comment
[09:50:30] <vgutierrez>	 the notes_url is 'https://wikitech.wikimedia.org/wiki/Ncredir'
[09:50:38] <vgutierrez>	 how's that URL encoded?
[09:50:43] <wikibugs>	 (03PS9) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668)
[09:51:24] <moritzm>	 I have no idea, only saw the alert in our Icinga :-)
[09:51:43] <vgutierrez>	 yeah, thanks for pinging me
[09:51:54] <vgutierrez>	 but I dunno what's going on here TBH
[09:52:28] <vgutierrez>	 hmmm from build_notes_url.pp
[09:52:31] <vgutierrez>	  # The notes link always has to come first to ensure the correct icon is used in icinga
[09:52:31] <vgutierrez>	     # we start with `[]` so puppet knows we want a array
[09:52:31] <vgutierrez>	     $links = [] + $notes_link + $dashboard_links
[09:52:35] <vgutierrez>	 fixing....
[09:53:36] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548)
[09:53:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[09:54:01] <vgutierrez>	 wonderful
[09:54:14] <wikibugs>	 10Operations, 10DBA, 10Jade, 10Patch-For-Review, and 2 others: Review Jade data storage and architecture proposal [RFC] - https://phabricator.wikimedia.org/T200297 (10awight) Congratulations, looking forward to seeing this deployed!
[09:54:23] <vgutierrez>	 ah.. rebasing issues :)
[09:54:24] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548)
[09:55:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ncredir: Fix notes_url [puppet] - 10https://gerrit.wikimedia.org/r/523888 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[09:58:07] <wikibugs>	 (03PS6) 10Jcrespo: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff)
[09:59:59] <vgutierrez>	 nope.. that wasn't the issue :/
[10:00:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnish: remove varnishreqstats-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/523891 (https://phabricator.wikimedia.org/T184942)
[10:00:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnish: ensure varnishreqstats is absent [puppet] - 10https://gerrit.wikimedia.org/r/523892 (https://phabricator.wikimedia.org/T184942)
[10:00:44] <jbond42>	 vgutierrez: ill take a look at this im familure with the notes_url stuff and think it is unrelated to your change
[10:00:50] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "ncredir: Fix notes_url" [puppet] - 10https://gerrit.wikimedia.org/r/523893
[10:01:06] <vgutierrez>	 jbond42: yeah, it looks right on the first one
[10:01:11] <vgutierrez>	 I'm reverting my last commit
[10:01:16] <ema>	 jbond42: thanks for figuring out the -logs /dev/stdin thing! <3
[10:01:43] <jbond42>	 ema: np, was suggested by the upstream dev
[10:01:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "ncredir: Fix notes_url" [puppet] - 10https://gerrit.wikimedia.org/r/523893 (owner: 10Vgutierrez)
[10:01:56] <jbond42>	 vgutierrez: yes the first one is fine
[10:02:33] <vgutierrez>	 all yours then :)
[10:03:13] <jbond42>	 thanks :)
[10:04:17] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[10:04:18] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:08:19] <moritzm>	 !log rebooting lithium for kernel update
[10:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:57] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[10:18:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[10:18:30] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[10:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:14] <wikibugs>	 (03PS4) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668)
[10:19:21] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:19:38] <wikibugs>	 (03PS10) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668)
[10:19:52] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:20:51] <moritzm>	 !log disabled icinga1001 in meta monitoring
[10:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:14] <wikibugs>	 (03PS1) 10Jbond: Icinga: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897
[10:21:50] <vgutierrez>	 jbond42:  this one looks the offender BTW: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/alerts.pp#L202
[10:22:07] <jbond42>	 vgutierrez: lol see the patch i just sent above
[10:22:07] <wikibugs>	 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1003.eqiad.wmnet'] `  and were **ALL** successful.
[10:22:14] <vgutierrez>	 ahaha right
[10:22:22] <vgutierrez>	 it should say prometheus: in the commit message right?
[10:22:31] <vgutierrez>	 I mean, it's a change on the prometheus profile
[10:22:44] <wikibugs>	 (03PS2) 10Jbond: Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897
[10:22:47] <jbond42>	 yep fixed
[10:22:52] <vgutierrez>	 <3 thx
[10:23:08] <wikibugs>	 (03PS3) 10Jbond: Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897
[10:23:17] <moritzm>	 !log rebooting icinga1001 for kernel update
[10:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Icinga - prometheus::alert: ensure dashboard links are not url encoded [puppet] - 10https://gerrit.wikimedia.org/r/523897 (owner: 10Jbond)
[10:30:45] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[10:30:49] <godog>	 !log start rolling reboot of ms-be eqiad hosts - T225713
[10:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:02] <stashbot>	 T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713
[10:33:03] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 4383 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:34:13] <icinga-wm>	 RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[10:36:53] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[10:36:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch ORES pool counters for eqiad to 1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[10:37:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks. Merging!" [puppet] - 10https://gerrit.wikimedia.org/r/523854 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[10:41:14] <godog>	 !log install updated linux-image-4.9.0-9-amd64 on ms-be hosts
[10:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:09] <wikibugs>	 (03PS1) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668)
[10:43:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:45:56] <wikibugs>	 (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:46:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:47:40] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi)
[10:49:09] <wikibugs>	 (03PS2) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668)
[10:53:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[10:53:49] <moritzm>	 !log re-enabled icinga1001 in meta monitoring
[10:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[10:55:49] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) p:05Triage→03Normal
[10:59:39] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1189 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1100).
[11:00:04] <jouncebot>	 matthiasmullie and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:23] <dcausse>	 o/
[11:00:45] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond) during my research i noticed that puppet db was failing with the following error  ` Compiling catalog for achernar.wikimed...
[11:00:48] <matthiasmullie>	 o/
[11:00:59] <matthiasmullie>	 dcausse: mine will take a little longer
[11:01:11] <matthiasmullie>	 let;s do yours first
[11:01:25] <icinga-wm>	 RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[11:01:27] <dcausse>	 matthiasmullie: sure thanks
[11:01:27] <matthiasmullie>	 want to deploy yourself, or want me to do it?
[11:01:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[11:01:46] <wikibugs>	 (03PS2) 10Jbond: puppet_compiler: Add checks for missing facts files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 (https://phabricator.wikimedia.org/T228266)
[11:01:49] <dcausse>	 matthiasmullie: I can deploy
[11:02:07] <matthiasmullie>	 okay
[11:02:18] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse)
[11:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse)
[11:03:32] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] switch search traffic (except completion) to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523860 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse)
[11:05:16] <wikibugs>	 (03PS1) 10Vgutierrez: Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548)
[11:06:17] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond)
[11:08:29] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T227136: [cirrus] switch search traffic (except completion) to codfw (duration: 00m 54s)
[11:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:37] <stashbot>	 T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136
[11:10:03] <icinga-wm>	 RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops
[11:10:34] <Daimona>	 Urbanecm Hello :-) Please ping me when you're ready & deployments are over, thanks
[11:11:21] <dcausse>	 matthiasmullie: I'm done
[11:11:48] <matthiasmullie>	 k, thanks
[11:12:16] <matthiasmullie>	 Daimona: I'll let you know once I'm done, but will need a full scap, will take some time
[11:12:42] <Daimona>	 Yay, no hurry, thanks
[11:15:55] <hauskatze>	 ugh, full scap :-)
[11:16:01] <hauskatze>	 slooooooow
[11:16:02] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#5339132, @greg wrote: >>>! In T211881#5332195, @akosiaris wrote: >> the hardwar...
[11:16:24] <dcausse>	 !log reindexing wikidata (elastic@eqiad) T227136
[11:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:31] <stashbot>	 T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136
[11:22:49] <icinga-wm>	 PROBLEM - Keyholder SSH agent on icinga1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[11:22:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[11:23:15] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Hiera incompatible with newer versions of puppet - https://phabricator.wikimedia.org/T227779 (10jbond) 05Open→03Resolved
[11:23:21] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond)
[11:25:48] <Urbanecm>	 matthiasmullie, why does https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/523879 need a full scap?
[11:26:09] <icinga-wm>	 RECOVERY - Keyholder SSH agent on icinga1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[11:26:39] <Urbanecm>	 scap sync-file /srv/mediawiki-stagging/php-1.34.0-wmf.14/extensions/WikibaseMediaInfo should be enough
[11:26:46] <Urbanecm>	 unless I'm overseeing something
[11:27:02] <matthiasmullie>	 hrm
[11:27:12] <matthiasmullie>	 for some reason, I thought it didn't take directories
[11:27:13] <matthiasmullie>	 you're right
[11:27:13] <matthiasmullie>	 and just in time, was about to scap :p
[11:27:40] <Urbanecm>	 full scap is required for a) i18n/namespace changes b) new directories added
[11:29:44] <matthiasmullie>	 I suppose b) got me confused over directories :p
[11:29:47] <matthiasmullie>	 TIL!
[11:29:48] <matthiasmullie>	 thanks
[11:30:04] <matthiasmullie>	 syncing now
[11:30:04] <logmsgbot>	 !log mlitn@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/WikibaseMediaInfo: [WikibaseMediaInfo] Revert "Add Wikidata links to statement UI elements" (duration: 00m 56s)
[11:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:52] <matthiasmullie>	 Daimona: I'm done
[11:30:56] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5340418, @Reedy wrote: > Mounting it where though?  Active maintenance...
[11:31:35] <Daimona>	 Thanks
[11:31:47] <Daimona>	 I still need 10 minutes then I'm ready to start
[11:32:23] <Urbanecm>	 Daimona, I'll be back in about 20 mins, will ping you
[11:35:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[11:36:14] <wikibugs>	 (03PS1) 10Jbond: puppetmaster1003: convert puppetmaster1003 from spare top puppetmaster::backend [puppet] - 10https://gerrit.wikimedia.org/r/523907 (https://phabricator.wikimedia.org/T201342)
[11:37:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster1003: convert puppetmaster1003 from spare top puppetmaster::backend [puppet] - 10https://gerrit.wikimedia.org/r/523907 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond)
[11:38:33] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) it seems that  container synchronization is broken a...
[11:40:15] <hauskatze>	 tgr: thanks re. MassMessage :)
[11:40:38] <hauskatze>	 slowly learning on my (few) free time
[11:45:21] <Urbanecm>	 Daimona, I'm back
[11:45:29] <Daimona>	 Ready
[11:45:35] <Urbanecm>	 cool!
[11:45:38] <Daimona>	 So we can start?
[11:45:40] <Urbanecm>	 Sure
[11:45:56] <Daimona>	 Alright!
[11:46:04] <Daimona>	 So, first of all I'd like to see another dry run
[11:46:20] <Urbanecm>	 sure
[11:46:27] <Daimona>	 Since there've been some on-wiki changes
[11:46:59] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[confd],Group[gitpuppet] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[11:47:26] <Urbanecm>	 Running foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php --dry-run
[11:47:31] <Daimona>	 Thanks
[11:47:39] <Daimona>	 Are you from mwmaint1002?
[11:48:12] <Urbanecm>	 yes, running from that host
[11:48:29] <Urbanecm>	 why do you ask?
[11:48:33] <Daimona>	 Great, then I'm filtering for it on logstash
[11:48:41] <Urbanecm>	 aha!
[11:48:56] <Daimona>	 To ensure nothing wrong, although we shouldn't have problems
[11:48:59] <hauskatze>	 additionally we could have Daimona on the deployment group as well /mehides
[11:49:24] <Daimona>	 hauskatze: I almost never need to deploy stuff :-)
[11:49:55] <Urbanecm>	 Daimona, or restricted (that's mwlog1001, mwmaint1002 etc)
[11:50:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP profile::mediawiki::jobrunner: eature flags [puppet] - 10https://gerrit.wikimedia.org/r/523908
[11:50:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-2] "This should be merged after we have enabled the use of feature flags on jobrunners (523908)" [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli)
[11:51:48] <Urbanecm>	 Daimona, currently on iewiki
[11:52:19] <Daimona>	 Alright, we'll wait :)
[11:52:57] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[11:53:27] <Daimona>	 Meh that error on labtestwiki again
[11:54:16] <Urbanecm>	 i wouldn't say that's an issue
[11:54:29] <Urbanecm>	 labtestwiki is even inaccessible for the public
[11:54:36] <Daimona>	 Yeah indeed
[11:54:40] <Daimona>	 Just some logspam
[11:54:42] <Urbanecm>	 yup
[11:56:21] <wikibugs>	 (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122)
[11:57:06] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs1004. [puppet] - 10https://gerrit.wikimedia.org/r/523867 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[11:58:09] <Urbanecm>	 Daimona, https://phabricator.wikimedia.org/P8759
[11:58:20] <Daimona>	 Thanks, gonna filter and diff
[11:58:29] <Urbanecm>	 great
[11:58:32] <Urbanecm>	 ping me once you're ready
[11:58:50] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer
[11:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1200)
[12:00:11] <Daimona>	 Alright, same as last time, - cawikinews fixed on-wiki
[12:00:26] <Daimona>	 So, ready to move on
[12:00:36] <Daimona>	 I'd like to see it on cawiki only first
[12:00:42] <Daimona>	 Just plz gimme a moment to open it
[12:01:28] <Daimona>	 OK, ready for cawiki
[12:02:07] <Daimona>	 (ping Urbanecm)
[12:02:39] <Urbanecm>	 Daimona, ok, running on cawiki
[12:03:44] <Urbanecm>	 Daimona, it said "Throttle parameters successfully normalized. Changed 2 rows."
[12:04:00] <Daimona>	 Yep https://ca.wikipedia.org/wiki/Especial:Filtre_d%27abuses/history/9/diff/prev/164
[12:04:15] <Daimona>	 Lemme check the afh table from quarry just to be sure
[12:04:25] <Urbanecm>	 no idea if it's good, but looks so :)
[12:04:27] <Daimona>	 Uh actually, it's not on quarry
[12:04:38] <Urbanecm>	 Daimona, you can write your query here, I can run it for you
[12:04:56] <Daimona>	 Uhm let's see
[12:05:15] <Daimona>	 SELECT * FROM abuse_filter_history WHERE afh_id = 164
[12:05:21] <Daimona>	 Can be posted publicly because the filter is public
[12:06:00] <Urbanecm>	 Daimona, running
[12:06:05] <Daimona>	 Ty
[12:06:31] <Urbanecm>	 Daimona, https://phabricator.wikimedia.org/P8761
[12:06:48] <Daimona>	 Of note, the script removed "user," instead of just the comma, but I guess I wrote it like that just to keep previous behaviour. I'll have to write an on-wiki notice
[12:07:08] <Urbanecm>	 Ok
[12:07:52] <Daimona>	 Yeah, it's fine
[12:08:00] <Urbanecm>	 Wonderful :)
[12:08:06] <Daimona>	 Now viwiki alone
[12:08:08] <Urbanecm>	 doing
[12:08:52] <Urbanecm>	 Daimona, https://phabricator.wikimedia.org/P8762
[12:09:43] <Daimona>	 Uhm
[12:10:15] <Urbanecm>	 what's happening?
[12:10:28] <Daimona>	 Seems like no changes were made, but maybe I just opened the page too late
[12:10:36] <Daimona>	 Lemme check the source
[12:10:55] <Urbanecm>	 ok
[12:11:26] <Urbanecm>	 !log Ran extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php for cawiki and viwiki (T209565)
[12:11:28] <Daimona>	 Could you please run: SELECT * FROM abuse_filter_history WHERE afh_id = 48
[12:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:33] <stashbot>	 T209565: Dry run for normalizeThrottleParameters.php - https://phabricator.wikimedia.org/T209565
[12:11:34] <Urbanecm>	 certainly
[12:11:37] <Daimona>	 Ty
[12:11:51] <Daimona>	 I think it's fine, we don't beautify groups for old rows I guess
[12:12:35] <Daimona>	 Only if they're empty, plus it added explicit 0s in the other params, so I believe it's working as intended
[12:13:09] <Urbanecm>	 Daimona, https://phabricator.wikimedia.org/P8763 (WMF-NDA only paste, that filter looks non-public)
[12:13:23] <Daimona>	 Yeah, thanks it's indeed private, forgot to say that
[12:14:09] <Daimona>	 OK as I suspected, only 0s were added, which is fine
[12:14:22] <Urbanecm>	 good
[12:14:25] <Daimona>	 So... Let's unleash that little boy on all wikis!
[12:14:47] <Urbanecm>	 doing!
[12:15:12] <wikibugs>	 (03PS1) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342)
[12:15:39] <Urbanecm>	 !log Running foreachwiki extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php in tmux session on mwmaint1002 (T209565)
[12:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:54] <Urbanecm>	 Daimona, would you need the queries for changed history rows?
[12:21:26] <Daimona>	 I don't think we do, the ones we got looked promising
[12:21:48] <Urbanecm>	 ok
[12:21:59] <Daimona>	 Did it complete?
[12:22:02] <Urbanecm>	 not yet
[12:22:07] <Urbanecm>	 ruwiktionary
[12:22:22] <Daimona>	 Great
[12:23:18] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T228245 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like another case of hw raid controller lockup, I've rebooted and upgraded the controller firmware. Host came back normal!
[12:24:51] <Urbanecm>	 Daimona, we're done!
[12:24:58] <Daimona>	 Cool
[12:25:00] <Urbanecm>	 https://phabricator.wikimedia.org/P8764
[12:25:25] <Daimona>	 Thanks, now checking
[12:25:29] <Urbanecm>	 ok
[12:25:35] <Urbanecm>	 let me know if you need anything
[12:27:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond)
[12:33:41] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema)
[12:33:49] <wikibugs>	 (03PS3) 10Ema: prometheus: fetch ATS origin server metrics [puppet] - 10https://gerrit.wikimedia.org/r/523898 (https://phabricator.wikimedia.org/T227668)
[12:34:04] <Urbanecm>	 Daimona, fyi, linked the outputs on the task ftr
[12:34:33] <Daimona>	 Yeah thanks
[12:34:44] <Daimona>	 I just finished sample-checking some wikis, and everything looks great!
[12:35:04] <Daimona>	 So well, I'll just go ahead and resolve a few tasks
[12:35:14] <Daimona>	 Thanks a lot for your help!
[12:36:27] <godog>	 !log upgrade hp raid firmware on ms-be1 hosts - T141756
[12:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:34] <stashbot>	 T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756
[12:36:52] <Urbanecm>	 happy to help Daimona!
[12:45:19] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) **Status**  There is now a Javamelody prometheus exporter at https://gerrit.wikimedia.or...
[12:45:26] <wikibugs>	 (03PS2) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342)
[12:45:29] <wikibugs>	 (03PS1) 10Jbond: standard: remove has_admin global variable [puppet] - 10https://gerrit.wikimedia.org/r/523914
[12:46:24] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/523718 (owner: 10MSantos)
[12:47:37] <wikibugs>	 (03PS2) 10Gehel: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/523718 (owner: 10MSantos)
[12:47:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] standard::base - reorder: Ensure admin runs early (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342) (owner: 10Jbond)
[12:47:55] <godog>	 anomie: please let me know when you are around, I'd like to merge https://gerrit.wikimedia.org/r/#/c/493323/ and then ask you to validate that things look good, let me know when it is a good time to do that
[12:48:08] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff)
[12:49:19] <wikibugs>	 (03PS2) 10Gehel: maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff)
[12:49:24] <wikibugs>	 (03PS3) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342)
[12:49:55] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch ORES pool counters for codfw to 2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[12:50:08] <anomie>	 godog: I'm around now, but about to have some meetings. A good time for me would probably start in 2 hours and 15 minutes or so.
[12:50:13] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523876 (owner: 10Muehlenhoff)
[12:51:57] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Switch ORES pool counters for codfw to 2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[12:52:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/521835 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff)
[12:52:57] <godog>	 anomie: sounds good, I have a meeting in three hours, ping me after your meetings and we can do it
[12:54:05] <hashar>	 so for T228250 """PHP Notice: Undefined property: stdClass::$module in OATHAuth/src/OATHUserRepository.php on line 193"""
[12:54:05] <stashbot>	 T228250: PHP Notice:  Undefined property: stdClass::$module in OATHAuth/src/OATHUserRepository.php on line 193 - https://phabricator.wikimedia.org/T228250
[12:54:12] <hashar>	 that seems to be solely for translatewiki.net
[12:54:30] <hashar>	 according to the task, the cause is a database change in OATHAuth extension https://phabricator.wikimedia.org/rEOATea984e5c2b2edd24f00c90766d640a65aafb75fa
[12:54:31] <wikibugs>	 (03PS4) 10Jbond: standard::base - reorder: Ensure admin runs early [puppet] - 10https://gerrit.wikimedia.org/r/523913 (https://phabricator.wikimedia.org/T201342)
[12:54:36] <hashar>	 which got merged / included in 1.34.0-wmf.11
[12:54:47] <hashar>	 so if we had the issue on wmf production we would surely have the same error
[12:54:57] <hashar>	 or would have noticed (since the task claims that users are unable to login)
[12:55:33] <hashar>	 https://phabricator.wikimedia.org/T225643 hints at a database schema change that occurred on oauthauth_users table to add columns 'module' and 'data'
[12:55:40] <hashar>	 so I guess WMF prod is covered and working fine
[12:55:52] <hashar>	 == it is not a blocker to the train ;-]
[12:55:54] <hashar>	 liw: ^^
[12:55:57] <hashar>	 public summary!
[12:56:03] <liw>	 ack, thanks hashar 
[12:56:15] <liw>	 the train deployment window is opening in a couple of minutes
[12:57:12] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[12:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 liw: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1300).
[13:03:05] <hashar>	 dcausse: hi, in case oyu are around we found some warning with CirrusSearch :-\
[13:03:06] <hashar>	 PHP Warning: Attempted to serialize unserializable builtin class Closure$CirrusSearch\Profile\CompletionSearchProfileRepository::__construct;3047
[13:03:41] <hashar>	 task being filled
[13:04:29] <dcausse>	 hashar: thanks, looking
[13:04:36] <hashar>	 dcausse: repro https://www.mediawiki.org//w/api.php?action=query&format=json&formatversion=2&prop=extracts%7Cpageimages%7Cdescription%7Cpageprops&generator=search&gsrlimit=3&gsrprop=redirecttitle&gsrsearch=morelike%3AWikimedia%20Apps%2FiOS%20FAQ%2Fja&gsrwhat=text&exchars=256&exintro=&exlimit=3&explaintext=&pilicense=any&pilimit=3&piprop=thumbnail&pithumbsize=120 
[13:05:10] <hashar>	 dcausse: and there is a second code path causing the issue
[13:06:47] <ema>	 !log prometheus servers: remove varnish-upload_$dc_backend.yaml, replaced by ATS equivalent T227668
[13:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:55] <stashbot>	 T227668: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668
[13:07:47] <wikibugs>	 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi)
[13:09:28] <liw>	 dcausse, hashar: https://phabricator.wikimedia.org/T228276 is the ticket I just filed for this
[13:09:54] <dcausse>	 liw: thanks I'm on it
[13:10:46] <liw>	 dcausse, thanks!
[13:10:53] <ema>	 !log cp-codfw: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672
[13:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:01] <stashbot>	 T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672
[13:11:30] <hashar>	 seems that breaks API queryes when the generator is the search system
[13:11:34] <hashar>	 or something like that
[13:13:01] <hashar>	 one of the error had for referrer https://cho.m.wikipedia.org/wiki/Hattak
[13:18:36] <hashar>	 dcausse: fun, php7.2 does throw an exception "Serialization of 'Closure' is not allowed" 
[13:18:52] <hashar>	 slightly different message :]
[13:20:53] * liw is entirely out of his depth trying to understand this stuff, so treats anything as a blocker
[13:21:15] <wikibugs>	 (03PS5) 10Muehlenhoff: Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624
[13:26:20] <moritzm>	 !log disabled puppet on Icinga hosts in preparation of adding the LDAP replicas/codfw to LVS
[13:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff)
[13:28:08] <wikibugs>	 (03PS2) 10BBlack: Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[13:28:10] <wikibugs>	 (03PS1) 10BBlack: Add domain root addrs for ncredir [dns] - 10https://gerrit.wikimedia.org/r/523924 (https://phabricator.wikimedia.org/T133548)
[13:28:42] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[13:33:34] <hashar>	 dcausse: I guess I can just +2 your change :)
[13:33:52] <dcausse>	 hashar: please :)
[13:35:15] <wikibugs>	 (03PS2) 10Gehel: wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122)
[13:35:18] <liw>	 progress!
[13:35:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: introduced tuned journal options to wdqs2005. [puppet] - 10https://gerrit.wikimedia.org/r/523868 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel)
[13:37:59] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer
[13:38:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:01] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.252:389, 208.80.153.252:636]) https://wikitech.wikimedia.org/wiki/PyBal
[13:46:51] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 10 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[13:48:05] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.252:389, 208.80.153.252:636]) https://wikitech.wikimedia.org/wiki/PyBal
[13:49:47] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2005 is CRITICAL: CRITICAL: 10 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[13:50:26] <elukey>	 seems to be ldap-ro.codfw.wikimedia.org.
[13:50:33] <elukey>	 moritzm: --^
[13:50:38] <elukey>	 pretty sure it is ok
[13:50:43] <elukey>	 just wanted to triple check
[13:53:58] <moritzm>	 yeah, I'd expect that's the effect of the new endpoints being available, but pybal not yet restarted
[13:55:43] <wikibugs>	 (03PS1) 10Ema: restbase: add TLS support via tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411)
[13:55:50] <ema>	 moritzm: indeed, all good
[13:57:25] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: (OoW) db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10Marostegui) 05Open→03Declined Going to close this ticket as I have created the decommission one: {T228281}
[13:57:29] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[13:57:40] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[13:57:49] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[13:58:44] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[14:00:33] <wikibugs>	 (03PS1) 10Ema: secret: dummy key for restbase [labs/private] - 10https://gerrit.wikimedia.org/r/523929 (https://phabricator.wikimedia.org/T210411)
[14:02:13] <logmsgbot>	 !log liw@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CirrusSearch/includes/Searcher.php: Do not serialize ResultsType instance T228276 (duration: 00m 55s)
[14:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:22] <stashbot>	 T228276: PHP Warning: Attempted to serialize unserializable builtin class Closure$CirrusSearch\Profile\CompletionSearchProfileRepository::__construct;2912 - https://phabricator.wikimedia.org/T228276
[14:03:50] <hashar>	 dcausse: solved! ( it works: https://www.mediawiki.org/w/api.php?action=query&format=json&formatversion=2&prop=extracts%7Cpageimages%7Cdescription%7Cpageprops&generator=search&gsrlimit=3&gsrprop=redirecttitle&gsrsearch=morelike%3AWikimedia%20Apps%2FiOS%20FAQ&gsrwhat=text&exchars=256&exintro=&exlimit=3&explaintext=&pilicense=any&pilimit=3&piprop=thumbnail&pithumbsize=120  )
[14:03:53] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for restbase [labs/private] - 10https://gerrit.wikimedia.org/r/523929 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[14:04:18] <wikibugs>	 (03PS1) 10Fsero: swift: enable logging for container-sync-to-sync [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196)
[14:04:41] <dcausse>	 hashar: thanks!
[14:05:04] <dcausse>	 liw: sorry about that!
[14:05:48] <hashar>	 jouncebot: next
[14:05:48] <jouncebot>	 In 1 hour(s) and 54 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600)
[14:05:50] <hashar>	 :]
[14:05:53] <hashar>	 jouncebot: now
[14:05:53] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1300)
[14:06:15] <wikibugs>	 (03PS1) 10Lars Wirzenius: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931
[14:06:18] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius)
[14:07:15] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius)
[14:07:29] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523931 (owner: 10Lars Wirzenius)
[14:09:07] <moritzm>	 !log restarting pybal on backup LVSes in codfw
[14:09:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:45] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.14
[14:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:40] <logmsgbot>	 !log liw@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.14 (duration: 00m 54s)
[14:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:21] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2005 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:12:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: enable logging for container-sync-to-sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:13:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: enable logging for container-sync-to-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:14:19] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:14:29] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:23:07] <liw>	 dcausse, no worried, thanks for the quick fix
[14:24:35] <wikibugs>	 (03PS2) 10Fsero: swift: enable logging for container-sync-to-sync [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196)
[14:25:22] <wikibugs>	 (03CR) 10Fsero: swift: enable logging for container-sync-to-sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:26:13] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:27:16] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10fsero) as long @RStallman-legalteam comes back with a positive result, the clinic duty person will move this forward (thi...
[14:27:42] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10fsero) p:05Triage→03Normal
[14:29:08] <liw>	 and now there's a bunch of other new error messages in logstach
[14:29:14] <wikibugs>	 (03PS2) 10Fsero: Add accraze to deployment and deploy-service groups [puppet] - 10https://gerrit.wikimedia.org/r/523778 (https://phabricator.wikimedia.org/T228191) (owner: 10Halfak)
[14:30:20] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] Add accraze to deployment and deploy-service groups [puppet] - 10https://gerrit.wikimedia.org/r/523778 (https://phabricator.wikimedia.org/T228191) (owner: 10Halfak)
[14:30:54] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10RStallman-legalteam) The NDA is signed. Fine to move forward. Thanks!
[14:30:59] <gehel>	 !log repool maps1004 - T218097
[14:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:07] <stashbot>	 T218097: [Bug] Some OSM relations didn't become polygons and are not been served through geoshapes service - https://phabricator.wikimedia.org/T218097
[14:31:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! see nitpick inline for rsyslog and commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:31:30] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Add accraze to deployment and deploy-service groups. - https://phabricator.wikimedia.org/T228191 (10fsero) done.  @Halfak thanks for the patch
[14:31:50] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Add accraze to deployment and deploy-service groups. - https://phabricator.wikimedia.org/T228191 (10fsero) 05Open→03Resolved p:05Triage→03Normal
[14:32:04] <liw>	 dcausse, would "PHP Fatal Error from line 21 of /srv/mediawiki/php-1.34.0-wmf.14/extensions/CirrusSearch/includes/ElasticaErrorHandler.php: Object of class Elastica\Response could not be converted to string" also fall in your wheelhouse?
[14:32:34] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.postgresql.postgres-init
[14:32:39] <dcausse>	 liw: yes I think so, looking
[14:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:30] <liw>	 filing task
[14:34:33] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:34:48] <wikibugs>	 (03PS3) 10Fsero: swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196)
[14:34:58] <liw>	 dcausse, https://phabricator.wikimedia.org/T228283
[14:35:06] <moritzm>	 !log restart pybal on lvs2002 (codfw primary) T227778
[14:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:13] <stashbot>	 T227778: Create an LDAP replica in codfw (using LVS) - https://phabricator.wikimedia.org/T227778
[14:35:22] <wikibugs>	 (03CR) 10Fsero: swift: enable logging for container synchronization-to-synchronization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:35:57] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196) (owner: 10Fsero)
[14:36:07] <wikibugs>	 (03PS4) 10Fsero: swift: enable logging for container synchronization-to-synchronization [puppet] - 10https://gerrit.wikimedia.org/r/523930 (https://phabricator.wikimedia.org/T228196)
[14:37:39] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2002 is OK: OK: 12 connections established with conf2001.codfw.wmnet:2379 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:38:39] <wikibugs>	 (03PS1) 10Ottomata: Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128)
[14:39:07] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:39:39] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata)
[14:39:47] <wikibugs>	 (03PS2) 10Ottomata: Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128)
[14:39:56] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set cloudvirtan* to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/523935 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata)
[14:40:23] <wikibugs>	 (03CR) 10Elukey: "John I have a question for you if you have time. This morning while reviewing this change I recalled that undef values in erb do not alway" [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[14:41:00] <logmsgbot>	 !log otto@cumin1001 START - Cookbook sre.hosts.decommission
[14:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:23] <logmsgbot>	 !log otto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[14:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698)
[14:41:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by otto@cumin1001 for hosts: `cloudvirtan[10...
[14:43:51] <wikibugs>	 (03PS6) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066)
[14:45:11] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698)
[14:45:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) (owner: 10Alexandros Kosiaris)
[14:45:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add user alaasarhan [puppet] - 10https://gerrit.wikimedia.org/r/523937 (https://phabricator.wikimedia.org/T223698) (owner: 10Alexandros Kosiaris)
[14:45:41] <fsero>	 !log enabling container-sync logging T228196
[14:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:48] <stashbot>	 T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196
[14:46:17] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:48:07] <liw>	 https://phabricator.wikimedia.org/T228286 - another blocker filed: LocalFile.php: Call to a member function getName() on a non-object (null)
[14:48:24] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243)
[14:50:23] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui)
[14:50:44] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1001/17433/" [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) (owner: 10Marostegui)
[14:51:05] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10akosiaris) 05Open→03Resolved a:03akosiaris User has been ad...
[14:53:24] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Papaul) a:05Papaul→03Marostegui Replaced  with a used one.
[14:55:14] <moritzm>	 !log updated jenkins in thirdparty/ci (stretch) and thirdparty (jessie) to 2.176.2 (T228142)
[14:55:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:20] <wikibugs>	 (03PS1) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723)
[14:56:28] <wikibugs>	 (03PS2) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723)
[14:56:32] <wikibugs>	 (03PS11) 10CDanis: WIP dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523013
[14:56:33] <liw>	 the mediawiki-new-errors dashboard on logstash is shown about 18 new errors now, mostly database, local storage, or swift - anyone around who can take a look?
[14:56:34] <wikibugs>	 (03PS1) 10CDanis: WIP WIP broken dbctl: schemata [puppet] - 10https://gerrit.wikimedia.org/r/523943
[14:59:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Marostegui) Thanks - I can see it rebuilding: `       physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding) `
[15:00:08] <godog>	 !log poweroff ms-be2022 - T227667
[15:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:14] <stashbot>	 T227667: ms-be2022 misbehaving / error on boot - https://phabricator.wikimedia.org/T227667
[15:01:09] <wikibugs>	 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) 05Open→03Resolved @Gehel  I checked this server again today, all looks good. Resolving this task for now. We can reopen it anytime.   thanks.
[15:03:17] <jijiki>	 !log Depool mw2269 to reboot it - T227548
[15:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:25] <stashbot>	 T227548: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548
[15:03:38] <anomie>	 godog: I'm ready for https://gerrit.wikimedia.org/r/#/c/493323/ if you are
[15:03:54] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Reporting some info from https://github.com/ROCmSoftwarePlatfo...
[15:04:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Alright, nodes are role spare::system and decommed/downtimed in icinga.
[15:04:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata)
[15:05:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[15:05:59] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:07] <papaul>	 !log shutting down ms-be2022 for HW  troubleshooting
[15:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @cmjohnson back atcha :)
[15:07:07] <hashar>	 jouncebot: now
[15:07:07] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 52 minute(s)
[15:07:09] <hashar>	 jouncebot: next
[15:07:09] <jouncebot>	 In 0 hour(s) and 52 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600)
[15:07:21] <hashar>	 !log upgrading CI Jenkins # T228142
[15:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:12] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:08:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nicely done. I would have given up on providing default values, thanks for persevering" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[15:10:00] <icinga-wm>	 PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[15:10:17] <wikibugs>	 (03PS1) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945
[15:10:53] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[15:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite)
[15:11:35] <hashar>	 I am waiting for some jobs to complete
[15:11:44] <icinga-wm>	 PROBLEM - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:11:55] <papaul>	 !log shutting down mw2250 for disk replacement 
[15:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:06] <icinga-wm>	 PROBLEM - High lag on wdqs2005 is CRITICAL: 5631 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:12:34] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:13:04] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:13:04] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 5631 ge 3600 Gehel catching up on updates after data reset https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:13:30] <icinga-wm>	 PROBLEM - Host mw2250 is DOWN: PING CRITICAL - Packet loss = 100%
[15:14:36] <icinga-wm>	 PROBLEM - Host ms-be2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:49] <fsero>	 !log restarting swift-container-sync on ms-be* for getting logging configuration T228196
[15:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:56] <stashbot>	 T228196: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196
[15:15:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "> This morning while reviewing this change I recalled that undef values in erb do not always correspond to false, but I might misremember." [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[15:18:46] <icinga-wm>	 RECOVERY - Host mw2250 is UP: PING WARNING - Packet loss = 93%, RTA = 36.15 ms
[15:18:58] <icinga-wm>	 PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[15:19:02] <icinga-wm>	 PROBLEM - Nginx local proxy to videoscaler on mw2250 is CRITICAL: connect to address 10.192.0.76 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner
[15:19:30] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:19:31] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Jul-Sep-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) @Pruem ^^^ :)
[15:20:02] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Papaul) a:05Papaul→03MoritzMuehlenhoff Replaced both 500GB disks with 250GB disks . All your's for re-imaging
[15:20:49] <wikibugs>	 (03PS1) 10CDanis: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947
[15:21:32] <icinga-wm>	 PROBLEM - Host mw2250 is DOWN: PING CRITICAL - Packet loss = 100%
[15:22:08] <icinga-wm>	 RECOVERY - jenkins_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[15:23:58] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Investigate if puppetdbquery::query_resources should work on PCC - https://phabricator.wikimedia.org/T228266 (10jbond)
[15:25:00] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[15:25:08] <icinga-wm>	 RECOVERY - Host ms-be2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms
[15:25:43] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] "(The rest still has to be sorted out)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel)
[15:25:58] <icinga-wm>	 PROBLEM - Host mw2269 is DOWN: PING CRITICAL - Packet loss = 100%
[15:26:47] <jijiki>	 ^ dowtime expired
[15:27:38] * Urbanecm stagging on mwdebug
[15:29:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Capture calico deployment in code. [deployment-charts] - 10https://gerrit.wikimedia.org/r/523580 (https://phabricator.wikimedia.org/T227775) (owner: 10Fsero)
[15:30:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis)
[15:31:12] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): amusso apt broken due to python upgrade which triggers a replacement of zuul embedded python https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[15:31:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of minor comments inline, plus a question of whether we want to ship own own coredns chart under releases.wikimedia.org/charts or n" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero)
[15:32:12] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[jenkins],Package[zuul] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[15:32:40] <wikibugs>	 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2022 misbehaving / error on boot - https://phabricator.wikimedia.org/T227667 (10Papaul) a:05Papaul→03fgiunchedi Power drain, reboot the sever 3 times no more errors. @fgiunchedi  please feel free to double check and resolve task.  Thanks.
[15:33:23] <wikibugs>	 (03CR) 10Effie Mouzeli: profile:service_proxy: Add more hiera variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[15:35:58] <icinga-wm>	 RECOVERY - Host mw2269 is UP: PING OK - Packet loss = 0%, RTA = 38.11 ms
[15:36:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] profile:service_proxy: Add more hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[15:36:12] <wikibugs>	 (03PS5) 10Effie Mouzeli: profile:service_proxy: Add more hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/523703 (https://phabricator.wikimedia.org/T228063)
[15:37:36] <Urbanecm>	 !log Deployed patch for T207094 T228284 to wmf.13 and wmf.14
[15:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:43] <stashbot>	 T228284: SpecialCheckUser: Call to a member function userCan() on a non-object (null) - https://phabricator.wikimedia.org/T228284
[15:39:44] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: PCC always has an ERROR when compiling for servers with  profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond)
[15:40:32] <icinga-wm>	 PROBLEM - Host mw2269 is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:41] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10akosiaris) >>! In T224794#5339362, @wiki_willy wrote: > @akosiaris or @Volans - we can order drive replacements for this, since it's out of warranty, but I was trying to figure out how this correlates with the new...
[15:40:44] <icinga-wm>	 RECOVERY - Host mw2269 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms
[15:41:23] <wikibugs>	 (03CR) 10Fsero: "regarding the chart, i don't mind publishing it but this chart i do see it something pretty specific and internal of the use case." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero)
[15:41:24] <wikibugs>	 10Operations, 10ops-codfw: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548 (10Papaul) a:05Papaul→03jijiki Power drain, SSH  to mgmt is back working @jijiki  Please feel free to repool server  Thanks
[15:42:09] <wikibugs>	 10Operations, 10netops, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10ayounsi) Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possible to change the CNAMEs instead?
[15:42:19] <Urbanecm>	 I just got 15:41:23 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'php-1.34.0-wmf.14', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', '--include', 'redacted', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on 
[15:42:19] <Urbanecm>	 mw2269.codfw.wmnet returned [255]: ssh: connect to host mw2269.codfw.wmnet port 22: Connection timed out while emergency-deploying
[15:42:29] <wikibugs>	 (03PS2) 10CDanis: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947
[15:42:31] <wikibugs>	 (03PS1) 10CDanis: dbctl: part 2/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523950
[15:42:50] <icinga-wm>	 PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[15:43:14] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) we will be replacing lvs2006 with lvs2010
[15:43:34] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) p:05High→03Lowest
[15:43:35] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl: part 1/2 to bring schema in line with production (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis)
[15:43:36] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Termbox Staging - Change to internal docker repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/523771 (owner: 10Tarrow)
[15:44:54] <icinga-wm>	 RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1145 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:45:50] <akosiaris>	 Urbanecm: hosts seems to have crashed ~5m ago
[15:46:06] <akosiaris>	 it's back up right now, we 'll have to investigate a bit what happened
[15:46:10] <Urbanecm>	 akosiaris, thanks. Do I need to do anything (re-sync?) or will it be taken care by someone else?
[15:46:27] <akosiaris>	 I think you should resync just to be on the safe side 
[15:46:32] <Urbanecm>	 will do
[15:46:59] <Urbanecm>	 !log Re-syncing patch for T207094 T228284 and wmf.14
[15:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:06] <stashbot>	 T228284: SpecialCheckUser: Call to a member function userCan() on a non-object (null) - https://phabricator.wikimedia.org/T228284
[15:47:08] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: part 1/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523947 (owner: 10CDanis)
[15:47:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/523950 (owner: 10CDanis)
[15:47:21] <Urbanecm>	 thank you again akosiaris 
[15:47:35] <akosiaris>	 Urbanecm: thanks as well
[15:47:44] <Urbanecm>	 sync completed with no errors
[15:48:16] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: PCC always has an ERROR when compiling for servers with  profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) Investigating further this is due to how `populate_puppetdb` adds entries to the datab...
[15:48:54] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) 05Open→03Resolved a:03Vgutierrez >>! In T203194#5308402, @MoritzMuehlenhoff wrote: > @Vgutierrez The firmware update on the NICs fixed this for good, right? Can we clos...
[15:49:56] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: part 2/2 to bring schema in line with production [software/conftool] - 10https://gerrit.wikimedia.org/r/523950 (owner: 10CDanis)
[15:50:31] <wikibugs>	 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Papaul) I checked the system log, no memory errors or temperature warnings but found out that the server firmware is very old. We can depool the server if possible and I can upgrade the f...
[15:51:49] <wikibugs>	 (03PS1) 10Aklapper: Phab: Allow viewing ogg video files inline (instead of downloading) [puppet] - 10https://gerrit.wikimedia.org/r/523952 (https://phabricator.wikimedia.org/T228225)
[15:54:39] <icinga-wm>	 RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 790.3 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:57:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063)
[15:57:26] <wikibugs>	 (03PS1) 10Ema: restbase: add certificate for restbase.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/523956 (https://phabricator.wikimedia.org/T210411)
[15:58:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey)
[16:00:01] <wikibugs>	 10Operations: mw2269 rebooted/crashed unexpectedly on Jul 17th ~15:30UTC - https://phabricator.wikimedia.org/T228296 (10akosiaris) p:05Triage→03Normal
[16:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706)
[16:00:30] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] hieradata: Set connect_timeout for cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/523955 (https://phabricator.wikimedia.org/T228063) (owner: 10Effie Mouzeli)
[16:00:38] * Urbanecm has some config patches
[16:00:45] <wikibugs>	 (03Abandoned) 10EBernhardson: Increase services proxy connect timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/523194 (https://phabricator.wikimedia.org/T228063) (owner: 10EBernhardson)
[16:00:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706)
[16:00:54] <wikibugs>	 (03PS2) 10Urbanecm: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150)
[16:00:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm)
[16:01:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi)
[16:01:22] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) >>! In T228275#5341475, @ayounsi wrote: > Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possibl...
[16:01:47] <jbond42>	 !log copy confd package from stretch-wikimedia to buster-wikimedia
[16:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm)
[16:03:28] <wikibugs>	 (03CR) 10jenkins-bot: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) (owner: 10Urbanecm)
[16:03:38] <wikibugs>	 (03PS1) 10CDanis: bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958
[16:04:12] <wikibugs>	 (03PS2) 10Urbanecm: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141)
[16:04:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm)
[16:05:54] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:523686|Enable partial blocks on dewiki]] (T228150) (duration: 00m 54s)
[16:05:57] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[16:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:01] <stashbot>	 T228150: Enable partial blocks on the German Wikipedia - https://phabricator.wikimedia.org/T228150
[16:06:08] <wikibugs>	 (03Merged) 10jenkins-bot: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm)
[16:07:27] <cmjohnson1>	 !log powering off cloudvirt1014 for rack move T226188
[16:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:34] <stashbot>	 T226188: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188
[16:07:59] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Raise zh_classicalwiki requirement for autoconfirmed (T228141) (duration: 00m 55s)
[16:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:06] <stashbot>	 T228141: Change Autoconfirmed users' age and number of edits at zh-classical wiki - https://phabricator.wikimedia.org/T228141
[16:08:09] <Urbanecm>	 !log Morning SWAT done
[16:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:02] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) Indeed  the server is not showing the Smart Storage Battery status. Lets try to upgrade the server firmware since the last upgrade was from 2015.   @fgiunchedi  Let me know when we can de...
[16:11:10] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958 (owner: 10CDanis)
[16:11:16] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) {F29791228}
[16:11:52] <icinga-wm>	 PROBLEM - Host cloudvirt1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:24] <shdubsh>	 ^ paged
[16:12:25] <apergos>	 paged
[16:12:25] <volans>	 cmjohnson1: &&&
[16:12:32] <arturo>	 :-/
[16:12:36] <volans>	 s/&/^/
[16:12:44] <apergos>	 ah rack move, I see
[16:12:49] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops
[16:13:51] <wikibugs>	 (03CR) 10jenkins-bot: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) (owner: 10Urbanecm)
[16:14:02] <wikibugs>	 (03Merged) 10jenkins-bot: bump version: --version and dbctl unification fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/523958 (owner: 10CDanis)
[16:14:47] <icinga-wm>	 PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:17:02] <XioNoX>	 ^ FPC 5 PEM 1 is not powered
[16:17:10] <bblack>	 ?
[16:17:36] <icinga-wm>	 RECOVERY - Host cloudvirt1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
[16:18:16] <XioNoX>	 Host cloudvirt1014.mgmt is DOWN paging is a known issue: https://phabricator.wikimedia.org/T223458
[16:18:31] <elukey>	 PEM 1 is the power supply? 
[16:18:48] <wikibugs>	 10Operations, 10ops-eqdfw, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy) a:03Cmjohnson
[16:19:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy)
[16:19:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[16:19:58] <jijiki>	 !log Depool mw2181 - T205240
[16:20:00] <bblack>	 In english terms: I think that means we lost one of the redundant power inputs to one top of rack switch
[16:20:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[16:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:05] <stashbot>	 T205240: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240
[16:20:42] <XioNoX>	 yeah correct, lost redundant power
[16:20:47] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO (201907): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10greg)
[16:20:50] <XioNoX>	 FPC5 means row 5
[16:20:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[16:20:53] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:10] <elukey>	 thanks :)
[16:21:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Don't page on mgmt failures [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458)
[16:21:32] <bblack>	 cloudvirt1014 is in that same rack
[16:21:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10wiki_willy) @Cmjohnson - not sure if there's a loose connection somewhere on backup1001, but can you check it out when you have a few cycles?  This one needs to be up and runni...
[16:21:38] <elukey>	 Just wanted to know if the switch was down or only with one power input
[16:21:51] <bblack>	 the switch is probably up or there's be more alerts, I think
[16:22:01] <elukey>	 yes definitely
[16:22:04] <bblack>	 https://netbox.wikimedia.org/dcim/racks/13/
[16:22:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Don't page on mgmt failures [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris)
[16:22:33] <bblack>	 (for a list of hosts in the same rack as the switch with the PEM fail)
[16:23:20] <bblack>	 oh an even better URI for that: https://netbox.wikimedia.org/dcim/devices/?rack_id=13
[16:24:48] <papaul>	 !log shutting down mw2181 for firmware upgrade
[16:24:54] <elukey>	 bblack: the info about B5 was in the switch's logs or somewhere else? (trying to understand how to read those alarms)
[16:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:14] <XioNoX>	 There is "powering off cloudvirt1014 for rack move T226188" from cmjohnson1, Chris could you check if the power cables for asw2-b5-eqiad are properly seated
[16:25:15] <stashbot>	 T226188: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188
[16:25:51] <bblack>	 ah!
[16:26:05] <XioNoX>	 (in meeting, will follow up after)
[16:26:42] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Reedy) I would imagine we’re not going to be mounting a labs NFS host into a production host...
[16:26:56] <bblack>	 elukey: yeah we could probably stand to make some improvements in the alerting and UIs there... 
[16:27:21] <wikibugs>	 (03PS3) 10Elukey: aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723)
[16:27:51] <XioNoX>	 elukey: asw2-b-eqiad> show system alarms 
[16:27:51] <XioNoX>	 2019-07-17 16:11:01 UTC  Major  FPC 5 PEM 1 is not powered
[16:28:24] <cmjohnson1>	 XioNoX check now
[16:28:36] <elukey>	 XioNoX: ah so "FPC5 means row 5" is "rack 5" right?
[16:28:38] <XioNoX>	 that info isn't exposed over SNMP, so alerting would need to ssh to the device to run that command
[16:28:43] <bblack>	 for the uninitiated and/or without logging into network hardware, it is a bit of hoop jumping to follow that icinga switch alert down to a cause and a correlated physical rack location
[16:28:52] <XioNoX>	 rack 5, yeah :)
[16:29:01] <elukey>	 ah ok now it is clear, I was a bit confused :D
[16:29:07] <bblack>	 elukey: it does in this case, but I'm not sure it's a universal constant that FPC# == row#?
[16:29:11] <cmjohnson1>	 power cable on the pdu was loose
[16:29:36] <bblack>	 FPC# definitely correlates to the first number of interface naming when you look elsewhere though
[16:29:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: replace the amd-rocm component with amd-rocm26 [puppet] - 10https://gerrit.wikimedia.org/r/523942 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey)
[16:29:47] <icinga-wm>	 RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:29:48] <XioNoX>	 bblack: nah it's not, it's by convention. But at least now this is tracked in Netbox
[16:30:07] <bblack>	 asw2-b-eqiad FPC5 == interface ports named xe-5/x/y or ge-5/x/y on asw2-b-eqiad for sure
[16:30:45] <elukey>	 ahhh so PEM is Power Entry Modules, so many acronyms to learn :D
[16:30:47] <XioNoX>	 yep!
[16:31:01] <bblack>	 for most hardware, it's pretty trivial (manually or with links) to go from a hostname to the enclosing rack and so-on
[16:31:25] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I have quickly talked with @Paladox about it. He has tried the `metrics-reporter-prometh...
[16:31:27] <bblack>	 the switch stacks are kind of a special case, where it's not reliably trivial or easy
[16:31:39] <XioNoX>	 https://netbox.wikimedia.org/dcim/devices/1276/ see virtual chassis-> position
[16:31:43] <bblack>	 the failure is just for asw2-b-eqiad in icinga terms
[16:32:22] <bblack>	 figuring out it's FPC 5, and that FPC5 == Rack 5, is a bit challenging
[16:32:33] <elukey>	 yes that part I wanted/want to learn :)
[16:32:38] <dcausse>	 jouncebox: now
[16:32:43] <bblack>	 even with that link, nothing's explicitly saying FPC5 == Rack 5's TOR switch
[16:32:44] <elukey>	 it seems that I have a lot of info to work on now :)
[16:33:17] <dcausse>	 jouncebot: now
[16:33:17] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T1600)
[16:33:51] <bblack>	 (and nothing but digging deeper on switch CLI or staring at switch syslog entries even tells you that the initial icinga alert was specifically about FPC5/PEM1)
[16:35:15] <dcausse>	 I'm going to SWAT a MW patch if nobody objects
[16:35:53] <SMalyshev>	 dcausse: no objection
[16:36:52] <Lucas_WMDE>	 no objection, but perhaps !log reopen the SWAT since it was already closed? (though I’m not sure if that’s usually done, I just remember seeing it)
[16:36:59] <wikibugs>	 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) a:03Papaul
[16:37:50] <dcausse>	 !log reponing morning SWAT
[16:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5
[16:38:31] <wikibugs>	 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) @Marostegui  @jcrespo can you tell if it is 2TB SATA or SAS? IF it is 2TB SATA we have some new onces onsite.
[16:39:25] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5
[16:40:50] <elukey>	 !log execute reprepro clearvanished on install1002 to clear buster-wikimedia|thirdparty/amd-rocm (not used anymore)
[16:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:32] <wikibugs>	 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10jcrespo) SAS HD disks of 1.819 TB.
[16:42:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Add domain root addrs for ncredir [dns] - 10https://gerrit.wikimedia.org/r/523924 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[16:44:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Redirect already configured wikipedia non canonical domains to ncredir [dns] - 10https://gerrit.wikimedia.org/r/523902 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez)
[16:44:53] <ori>	 Krenair: I actually wouldn't mind being added to deployment prep so I can verify that CL
[16:45:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson)
[16:45:58] <wikibugs>	 10Operations, 10ops-eqiad, 10procurement: Procurement Request for 3x 4tb SAS Drives for Helium-Array - https://phabricator.wikimedia.org/T228302 (10wiki_willy)
[16:45:59] <Urbanecm>	 dcausse, I'm currently deploying
[16:46:14] <Urbanecm>	 (sorry for not announcing)
[16:46:20] <dcausse>	 Urbanecm: ok
[16:46:27] <Urbanecm>	 (it's for T207094)
[16:46:40] <dcausse>	 Urbanecm: I have a patch just merged on CirrusSearch for wmf14
[16:46:45] <Urbanecm>	 ack
[16:47:44] <logmsgbot>	 !log gehel@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[16:47:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:20] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Services Operations): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo)
[16:48:52] <Urbanecm>	 !log Deployed patch for T207094
[16:48:56] <Urbanecm>	 dcausse, I'm done
[16:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:07] <dcausse>	 Urbanecm: thanks
[16:49:08] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks for back history @akosiaris , we'll get the replacement drives ordered for you via procurement #T228302.  ~Willy
[16:49:20] <wikibugs>	 10Operations, 10CX-cxserver, 10Citoid, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pchelolo)
[16:52:30] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Services Operations): Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Pchelolo)
[16:52:57] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team (Services Operations): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Pchelolo)
[16:53:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I pretty much agree on not getting paged on mgmt NIC issues. +1, but I didn't test the patch in any way." [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris)
[16:53:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Re-install cloudvirt1014 with Stretch and the 10g nic [puppet] - 10https://gerrit.wikimedia.org/r/523969 (https://phabricator.wikimedia.org/T226188)
[16:54:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson)
[16:54:22] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CirrusSearch/includes/ElasticaErrorHandler.php: T228283: Log response data JSON on errors (duration: 00m 55s)
[16:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:29] <stashbot>	 T228283: ElasticaErrorHandler.php: Object of class Elastica\Response could not be converted to string - https://phabricator.wikimedia.org/T228283
[16:55:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Re-install cloudvirt1014 with Stretch and the 10g nic [puppet] - 10https://gerrit.wikimedia.org/r/523969 (https://phabricator.wikimedia.org/T226188) (owner: 10Andrew Bogott)
[16:55:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks! looks good to me. meant to remove paging for mgmt since a while" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris)
[16:56:44] <wikibugs>	 10Operations, 10ops-codfw: (OoW) Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Papaul) a:05Papaul→03jcrespo Disk replaced
[16:57:01] <dcausse>	 !log morning swat done
[16:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:12] <wikibugs>	 10Operations, 10DC-Ops, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrewbogott  This server is ready for you, i updated raid cfg to R10 and 2 spare di...
[16:57:50] <wikibugs>	 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo)
[16:57:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://phabricator.wikimedia.org/T223458" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris)
[16:58:40] <wikibugs>	 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team (Services Operations): Requests to MW 404 when on HTTPS - https://phabricator.wikimedia.org/T202982 (10Pchelolo)
[16:58:54] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) @Papaul you can take the server down as needed.
[16:59:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/17440/icinga1001.wikimedia.org/ but duplicate contact groups are not hurting it" [puppet] - 10https://gerrit.wikimedia.org/r/523963 (https://phabricator.wikimedia.org/T223458) (owner: 10Alexandros Kosiaris)
[17:00:14] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team (Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Pchelolo)
[17:00:22] <liw>	 it's a little after the deploy window, but it seems I need to roll back group1 because of https://phabricator.wikimedia.org/T228292
[17:01:57] <icinga-wm>	 RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational
[17:03:36] <liw>	 morning swat is over, nothing else on https://wikitech.wikimedia.org/wiki/Deployments for a bit, so going ahead with rollback
[17:06:30] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to 1.34.0-wmf.13"
[17:06:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:52] <wikibugs>	 (03CR) 10Dzahn: "hmm, i don't know this. adding herron" [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo)
[17:08:45] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: add Icinga notes url for nrpe_monitor_script [puppet] - 10https://gerrit.wikimedia.org/r/521380
[17:09:08] <papaul>	 !log shutting down restbase2009 for firmware upgrade
[17:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:41] <icinga-wm>	 PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100%
[17:12:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:13:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:13:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:13:37] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:14:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:13] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:16:20] <mutante>	 oh. thanks for logging that papaul, that explains 
[17:16:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:18:14] <icinga-wm>	 PROBLEM - Host mw2181.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:18:45] <Pchelolo>	 still not quite explains... in the perfect world that still shouldn't happen. I'll look into it
[17:21:02] <icinga-wm>	 PROBLEM - Host mw2181 is DOWN: PING CRITICAL - Packet loss = 100%
[17:21:32] <mutante>	 hmm. the mw host looks unexpected
[17:21:39] <mutante>	 is that right next to it ?
[17:21:48] <mutante>	 looking at that one
[17:22:28] <papaul>	 mutante: mw2181 was log already
[17:22:31] <papaul>	 https://phabricator.wikimedia.org/T205240
[17:23:03] <papaul>	 mutante: doing firmware upgrade on mw2181
[17:23:09] <mutante>	 papaul: gotcha! thanks
[17:27:24] <icinga-wm>	 RECOVERY - Host mw2181 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[17:27:43] <wikibugs>	 (03PS1) 10CDanis: debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972
[17:29:00] <icinga-wm>	 RECOVERY - Host mw2181.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.95 ms
[17:32:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] trafficserver: add Icinga notes url for nrpe_monitor_script [puppet] - 10https://gerrit.wikimedia.org/r/521380 (owner: 10Dzahn)
[17:34:33] <wikibugs>	 (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973
[17:34:34] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius)
[17:34:44] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972 (owner: 10CDanis)
[17:35:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius)
[17:36:10] <icinga-wm>	 RECOVERY - Host restbase2009 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
[17:36:46] <wikibugs>	 (03CR) 10jenkins-bot: Revert "group1 wikis to 1.34.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523973 (owner: 10Lars Wirzenius)
[17:37:22] <wikibugs>	 (03Merged) 10jenkins-bot: debian: release 1.1.1-1 [software/conftool] - 10https://gerrit.wikimedia.org/r/523972 (owner: 10CDanis)
[17:46:41] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo)
[17:55:01] <wikibugs>	 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) p:05High→03Normal
[17:55:23] <wikibugs>	 (03PS1) 10Kosta Harlan: Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310)
[17:56:51] <wikibugs>	 (03CR) 10Revi: [C: 03+1] Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan)
[17:57:00] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Beta: Add GrowthExperiments mentors list for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523977 (https://phabricator.wikimedia.org/T228310) (owner: 10Kosta Harlan)
[18:00:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] varnishmtail: use -logs /dev/stdin instead of -logfds 0 [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) (owner: 10Jbond)
[18:01:00] <cdanis>	 !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include stretch-wikimedia conftool/conftool_1.1.1-1_amd64.changes 
[18:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:14] <cdanis>	 !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include buster-wikimedia conftool/conftool_1.1.1-1+deb10u1_amd64.changes 
[18:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:25] <cdanis>	 !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include jessie-wikimedia conftool/conftool_1.1.1-1+deb8u1_amd64.changes 
[18:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:18] <cdanis>	 !log upgrade to python3-conftool 1.1.1-1 on mwdebug2001
[18:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:09] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s mw-canary
[18:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:04] <wikibugs>	 (03PS5) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[18:07:27] <wikibugs>	 (03CR) 10Volans: "LGTM, a couple of nits inline." (033 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond)
[18:07:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:08:34] <wikibugs>	 10Operations, 10ops-codfw: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Papaul) a:05Papaul→03MoritzMuehlenhoff This was a very long progress upgrading the IDRAC since the server had 1.5 I couldn't upgrade to 2.6 had to upgrade first to 1.6 than to 2.6 Bef...
[18:11:20] <wikibugs>	 (03PS6) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[18:12:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:12:32] <wikibugs>	 (03CR) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[18:12:42] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) a:05MoritzMuehlenhoff→03None
[18:14:10] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) After Firmware upgrade, we still have the Smart storage battery problem since the server is out of warranty we can not have the part replaced.
[18:14:50] <mutante>	 !log mw2181 - scap pull (T205240)
[18:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:57] <stashbot>	 T205240: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240
[18:15:42] <mutante>	 !log mw2181 - sudo: /usr/local/bin/mwscript: command not found  on scap pull ??
[18:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:47] <cdanis>	 !log testing conftool upgrade: cdanis@mw1261.eqiad.wmnet ~ % sudo -i depool
[18:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:19] <cdanis>	 !log cdanis@mw1261.eqiad.wmnet ~ % sudo -i pool  
[18:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:16] <icinga-wm>	 PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[18:23:27] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s eqsin
[18:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:03] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) Running 'scap pull' on this host (to sync mw code before repooling) fails with "sudo: /usr/local/bin/mwscript: command not found".
[18:25:21] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s ulsfo
[18:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:39] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s esams
[18:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:08] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s codfw
[18:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:45] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u 2019-07-17-conftool.yaml -s eqiad
[18:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:50] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188)
[18:40:44] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188)
[18:41:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1014: update network adapter names for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/523986 (https://phabricator.wikimedia.org/T226188) (owner: 10Andrew Bogott)
[18:49:48] <icinga-wm>	 PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[18:50:36] <wikibugs>	 (03PS1) 10Cwhite: gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988
[18:50:46] <icinga-wm>	 RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link
[18:51:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gemfile: bump safe_yaml to 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/523988 (owner: 10Cwhite)
[18:52:58] <wikibugs>	 (03PS7) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[18:53:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:53:41] <wikibugs>	 10Operations, 10DC-Ops, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Andrew) 05Open→03Resolved
[18:53:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[18:54:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[18:55:28] <icinga-wm>	 RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[18:55:35] <wikibugs>	 (03PS2) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945
[18:56:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite)
[18:59:17] <wikibugs>	 (03PS1) 10CDanis: dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989
[18:59:42] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew)
[19:01:26] <wikibugs>	 (03CR) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond)
[19:01:56] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht)
[19:03:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis)
[19:04:11] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis)
[19:04:22] <wikibugs>	 (03CR) 10Jbond: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite)
[19:04:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2181.codfw.wmnet
[19:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:10] <wikibugs>	 (03PS2) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943
[19:06:12] <wikibugs>	 (03PS12) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013
[19:06:43] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl schemata: move files to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/523989 (owner: 10CDanis)
[19:06:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (owner: 10CDanis)
[19:08:15] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) Made a separate task for the scap pull issue.  Repooled the server anyways.
[19:09:44] <wikibugs>	 (03CR) 1020after4: [C: 03+1] Phab: Allow viewing ogg video files inline (instead of downloading) [puppet] - 10https://gerrit.wikimedia.org/r/523952 (https://phabricator.wikimedia.org/T228225) (owner: 10Aklapper)
[19:10:58] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) MCE errors on mw2181 / temperature warnings - https://phabricator.wikimedia.org/T205240 (10Dzahn) 05Open→03Resolved a:03Dzahn mcelog has not been written to since Oct 10 2018. No new thermal events after that.  So not sure if that tells us much about the f...
[19:11:10] <wikibugs>	 (03PS3) 10CDanis: conftool: update schemata for dbctl [puppet] - 10https://gerrit.wikimedia.org/r/523943 (https://phabricator.wikimedia.org/T197126)
[19:11:12] <wikibugs>	 (03PS13) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126)
[19:12:29] <hoo>	 greg-g: You there? We would like to backport a LoadBalancer change to fix Wikidata dumps (https://phabricator.wikimedia.org/T228104)
[19:15:37] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn)
[19:16:22] <wikibugs>	 (03PS3) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945
[19:18:34] <greg-g>	 hoo: ok, swat or whenever ready, note: wmf.14 is only on group0 right now
[19:19:36] <hoo>	 greg-g: Why (and until when) is Wikidata on group0?
[19:19:57] <greg-g>	 until at least tomorrow
[19:20:02] <greg-g>	 https://phabricator.wikimedia.org/T220739
[19:21:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite)
[19:24:46] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) This seems like a bad idea.  Scratch is writable by all of cloud.  I do not want that m...
[19:25:06] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) We cross mount dumps NFS I believe to stats hosts (which might be production-ish), but...
[19:25:13] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF) We're seeing this happening now on contint...
[19:27:17] <wikibugs>	 (03PS1) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991
[19:27:19] <wikibugs>	 (03PS1) 10Dzahn: static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992
[19:27:21] <wikibugs>	 (03PS1) 10Dzahn: tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993
[19:27:23] <wikibugs>	 (03PS1) 10Dzahn: librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994
[19:27:25] <wikibugs>	 (03PS1) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995
[19:28:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991 (owner: 10Dzahn)
[19:28:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992 (owner: 10Dzahn)
[19:28:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993 (owner: 10Dzahn)
[19:29:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994 (owner: 10Dzahn)
[19:29:30] <wikibugs>	 (03PS2) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991
[19:29:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 (owner: 10Dzahn)
[19:31:31] <James_F>	 apergos, hoo: I've been poking at it on mwdebug1002 and it doesn't seem immediately and obviously broken, but…
[19:31:49] <apergos>	 oh, you've already scapped it out there?
[19:32:07] <James_F>	 Only onto mwdebug1002, not all of prod.
[19:32:12] <hoo>	 James_F: https://phabricator.wikimedia.org/T228104#5334937
[19:32:16] <apergos>	 yes, mwdebug, exactly
[19:32:19] <hoo>	 You (or I) can try that to verify
[19:32:20] <hoo>	 if you want
[19:32:47] <James_F>	 wikidata will be running group1 == wmf.13 code, so it won't test that.
[19:32:47] <wikibugs>	 (03PS8) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[19:33:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[19:33:29] <apergos>	 we need something in group0, uh...
[19:33:30] <hoo>	 James_F: You can also test it with whatever wiki you like
[19:33:35] <hoo>	 testwiki or so
[19:33:40] <apergos>	 mediawikiwiki
[19:33:44] <apergos>	 yeah testwiki is fine too
[19:33:56] <James_F>	 It doesn't fatal.
[19:34:08] <apergos>	 👍
[19:34:12] <James_F>	 I'm more worried about random other crap that dies.
[19:34:15] <hoo>	 That's how it's supposed to be
[19:34:24] <James_F>	 I generally trust coders and reviewers to test the bug they're fixing.
[19:34:45] <James_F>	 I worry about the watchlist suddenly being blank, or editing a page causing a cache stampede, or… ;-D
[19:34:50] <hoo>	 Yeah, backporting LB changes is not exactly nice
[19:35:16] <James_F>	 Eh, it's only group0.
[19:35:23] <apergos>	 "only" :-D
[19:35:33] <hoo>	 Maybe we should go to wmf14 first and wait for a 1 or 2 hours?
[19:35:39] <James_F>	 If MW.org breaks I'll notice sharp-ish.
[19:35:56] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer: T228104 rdbms: better handle a non-existing  defaultGroup in LoadBalancer (duration: 00m 55s)
[19:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:05] <stashbot>	 T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104
[19:36:08] <apergos>	 I can be here for 1-2 hours but after that I will be a pumpkin (it's already 10:30 pm)
[19:36:13] <wikibugs>	 (03PS9) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[19:36:18] <James_F>	 Yeah, I can push it later today if you want.
[19:36:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[19:36:44] <hoo>	 Yeah, let's wait a bit and see whether the wmf14 part of the world ends
[19:38:52] <James_F>	 For wmf.13 I'm going to need to fiddle to cherry-pick, fun.
[19:39:06] <hoo>	 Doesn't it apply cleanly?
[19:39:19] <hoo>	 Oh, I suppose the tests might clash
[19:39:22] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10thcipriani) For that particular image I can recreate locall...
[19:40:07] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5342835, @Bstorm wrote: > This seems like a bad idea.  Scratch is writ...
[19:40:27] <wikibugs>	 (03PS10) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[19:41:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[19:41:37] <apergos>	 ugh, sorry about that
[19:41:59] <James_F>	 It's fine. :-)
[19:42:25] <James_F>	 Just that rdbms is one of the few areas I have marked out in DANGER! tape in my mind. :-)
[19:45:45] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Andrew) Can you explain in greater detail what problem you're trying to fix?  I suspect that hi...
[19:45:54] <apergos>	 yeha, me too
[19:51:47] <wikibugs>	 (03CR) 10Dzahn: "Thanks Jcrespo. I think the best way forward is that we just say what you said here, not used in production.  The reason i want to add _an" [puppet] - 10https://gerrit.wikimedia.org/r/521382 (owner: 10Dzahn)
[19:53:47] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5342916, @Andrew wrote: > Can you explain in greater detail what probl...
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, and halfak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T2000).
[20:00:34] <apergos>	 James_F: it did go out to wmf14 everywhere, right?  I don't see any scap/log anything in here
[20:10:14] <logmsgbot>	 !log accraze@deploy1001 Started deploy [ores/deploy@676f7ba]: T228331
[20:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:21] <stashbot>	 T228331: Build revert model for glwiki - https://phabricator.wikimedia.org/T228331
[20:15:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 (owner: 10Dzahn)
[20:16:06] <wikibugs>	 (03PS3) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382
[20:23:21] <wikibugs>	 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10Papaul) p:05High→03Normal
[20:26:09] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Giving this a soft +1 on behalf of the WMF Security Team with the recommendation to review Daimona's suggesting about find_in_set above an" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel)
[20:28:07] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) I really must decline this request if that's the reason.  My thinking on this is:  1. T...
[20:30:11] <James_F>	 apergos: It did.
[20:30:33] <apergos>	 ok!  I have been scrying logstash just in case 
[20:30:55] <James_F>	 apergos: https://tools.wmflabs.org/sal/log/AWwBbyRrOwpQ-3PkId88
[20:31:17] <apergos>	 but not in here. hmmm....bad bots get beaten!
[20:31:45] <apergos>	 oh. I see it in here now. apparently my reading abilities have taken a nosedive
[20:31:51] <wikibugs>	 (03PS3) 10Ottomata: Refine mediawiki_revision_create events using schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/523791 (https://phabricator.wikimedia.org/T211248)
[20:31:55] <apergos>	 sorry for the noise!
[20:32:54] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Refine mediawiki_revision_create events using schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/523791 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata)
[20:33:01] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10bd808) 05Open→03Declined We can not mount filesystems from the Cloud Services network realm...
[20:33:13] <wikibugs>	 (03PS4) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382
[20:35:12] <logmsgbot>	 !log accraze@deploy1001 Finished deploy [ores/deploy@676f7ba]: T228331 (duration: 24m 59s)
[20:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:20] <stashbot>	 T228331: Build revert model for glwiki - https://phabricator.wikimedia.org/T228331
[20:37:20] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5343177, @bd808 wrote: > We can not mount filesystems from the Cloud S...
[20:39:54] <wikibugs>	 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10tstarling) Is this blocking deployment of PHP 7?
[20:43:25] <apergos>	 it's been an hour plus, and so far: no phab reports, no comments on mediawikiwiki itself (I'm stalking rc there), and nothing weird that I saw at any rate, in logstash
[20:43:36] <apergos>	 so, looking good so far hope-I-don't-jinx-it
[20:44:18] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) >>! In T153068#5343192, @Urbanecm wrote: > That's in contrary with what @Bstorm said, b...
[20:45:42] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Urbanecm) >>! In T153068#5343230, @Bstorm wrote: >>>! In T153068#5343192, @Urbanecm wrote: >> T...
[20:46:35] <wikibugs>	 10Operations, 10ops-codfw: (OoW) wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10Papaul) No memory errors showing on this system in the log . Upgrade IDRAC from 1.5 to 2.6 . We have a new BIOS  version available we need to depool the server for the upgrade
[20:51:22] <wikibugs>	 (03PS1) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024
[20:54:44] <wikibugs>	 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway)
[20:57:00] <wikibugs>	 (03CR) 10Eevans: "I am following https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm for the first time, and assuming that the deployment-charts re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024 (owner: 10Eevans)
[21:00:01] <wikibugs>	 (03PS1) 1020after4: Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670)
[21:00:14] <James_F>	 apergos, hoo: OK, things seem fine. I'll push it to wmf.13 too.
[21:00:34] <apergos>	 okey dkoey, yeah they still look good from here
[21:00:52] <logmsgbot>	 !log nuria@deploy1001 Started deploy [analytics/refinery@4f07755]: refinery 0.0.94
[21:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:11] <wikibugs>	 (03CR) 10Aklapper: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4)
[21:05:51] <wikibugs>	 (03CR) 1020after4: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4)
[21:07:35] <logmsgbot>	 !log otto@deploy1001 Started deploy [eventstreams/deploy@dbc9bbb]: Fix ?doc to use openapi instead of swagger - T227958
[21:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:43] <stashbot>	 T227958: stream.wikimedia.org/?doc returns an error page - https://phabricator.wikimedia.org/T227958
[21:10:04] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) 05Open→03Resolved looks good -- thanks @colewhite
[21:10:16] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:10:27] <logmsgbot>	 !log otto@deploy1001 Finished deploy [eventstreams/deploy@dbc9bbb]: Fix ?doc to use openapi instead of swagger - T227958 (duration: 02m 52s)
[21:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:57] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4)
[21:11:29] <wikibugs>	 (03PS4) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945
[21:15:08] <wikibugs>	 (03PS7) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066)
[21:15:47] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Flow: Clean up accidentally-deployed debugging code for T228290 (duration: 01m 02s)
[21:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:53] <stashbot>	 T228290: Fatal on Watchlist: Nesting level too deep - https://phabricator.wikimedia.org/T228290
[21:16:42] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:16:52] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.13/includes/libs/rdbms/loadbalancer: T228104 rdbms: better handle a non-existing  defaultGroup in LoadBalancer (duration: 00m 55s)
[21:16:52] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:14] <stashbot>	 T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104
[21:17:19] <wikibugs>	 (03PS8) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066)
[21:17:52] <wikibugs>	 (03Abandoned) 10Cwhite: proemtheus: testing centralize stats config (do not deploy) [puppet] - 10https://gerrit.wikimedia.org/r/523945 (owner: 10Cwhite)
[21:18:22] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:18:34] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:20:18] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[21:25:42] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Fix undefined 'err' and 'message' in php7-fatal-error [puppet] - 10https://gerrit.wikimedia.org/r/524036 (https://phabricator.wikimedia.org/T228345)
[21:27:24] <wikibugs>	 (03PS1) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037
[21:31:32] <wikibugs>	 (03PS1) 10Bstorm: toolforge: Enable pod security policy [puppet] - 10https://gerrit.wikimedia.org/r/524038 (https://phabricator.wikimedia.org/T227290)
[21:32:57] <wikibugs>	 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) Thinking about this a bit today, I'm no longer sure that the two puppet catalogs need to be disjoint.  If...
[21:34:51] <wikibugs>	 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Fito)
[21:36:14] <wikibugs>	 (03PS2) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037
[21:37:15] <logmsgbot>	 !log nuria@deploy1001 Finished deploy [analytics/refinery@4f07755]: refinery 0.0.94 (duration: 36m 28s)
[21:37:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:34] <nuria>	 !log deployment aborted for refinary 0.0.94
[21:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:05] <wikibugs>	 (03PS1) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043
[21:40:56] <wikibugs>	 (03PS2) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043
[21:42:03] <wikibugs>	 (03PS3) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043
[21:42:24] <apergos>	 !log started wikidata entity dumps json run on snapshot1008
[21:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043 (owner: 10Dzahn)
[21:43:39] <wikibugs>	 (03PS4) 10Dzahn: nrpe: remove unit tests [puppet] - 10https://gerrit.wikimedia.org/r/524043
[21:44:53] <wikibugs>	 (03PS3) 10Dzahn: microsites/transparency: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523991
[21:45:35] <wikibugs>	 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10media-storage, 10Wikimedia-production-error: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10greg) Adding #operations per #media-storage / @fgiunchedi...
[21:45:54] <wikibugs>	 (03PS2) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995
[21:46:06] <wikibugs>	 (03PS2) 10Dzahn: librenms: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523994
[21:46:22] <wikibugs>	 (03PS2) 10Dzahn: tendril: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523993
[21:46:33] <wikibugs>	 (03PS2) 10Dzahn: static-rt: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523992
[21:47:42] <wikibugs>	 (03CR) 10Dzahn: nrpe: add notes_url parameter to spec and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn)
[21:47:56] <icinga-wm>	 RECOVERY - MegaRAID on es2003 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:50:15] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17449/" [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi)
[21:51:51] <wikibugs>	 (03PS3) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377
[21:52:59] <wikibugs>	 (03PS4) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377
[21:56:02] <wikibugs>	 (03PS1) 10Ayounsi: Reserve IP for syslog anycast [dns] - 10https://gerrit.wikimedia.org/r/524045
[21:57:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Lock the phabricator authentication provider config options. [puppet] - 10https://gerrit.wikimedia.org/r/524026 (https://phabricator.wikimedia.org/T220670) (owner: 1020after4)
[21:59:06] <wikibugs>	 (03PS2) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024
[22:01:11] <wikibugs>	 (03PS3) 10Eevans: sessionstore staging - update to v1.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/524024
[22:01:44] <wikibugs>	 (03PS3) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037
[22:03:12] <wikibugs>	 (03PS4) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037
[22:04:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] wmnet: flip syslog.eqiad.wmnet to centrallog1001 [dns] - 10https://gerrit.wikimedia.org/r/523957 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi)
[22:16:21] <hoo>	 !log Manually started the Wikidata RDF dumps on snapshot1008 (due to T228104)
[22:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:29] <stashbot>	 T228104: Wikibase dump scripts fail on external storage access - https://phabricator.wikimedia.org/T228104
[22:33:24] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:33:48] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) a:05MoritzMuehlenhoff→03Dzahn
[22:35:09] <mutante>	 !log reimaging mw2250 after disks have been replaced
[22:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 (owner: 10Dzahn)
[22:36:55] <wikibugs>	 (03PS5) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377
[22:38:46] <icinga-wm>	 RECOVERY - Host mw2250 is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms
[22:39:40] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10elappen-WMF)
[22:41:15] <wikibugs>	 (03PS11) 10Dzahn: nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873)
[22:42:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[22:42:39] <wikibugs>	 (03CR) 10Ppchelko: Add change-prop event_service_uri and point at eventgate-main (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata)
[22:45:02] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:55:56] <wikibugs>	 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190717T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:06:49] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Add change-prop event_service_uri and point at eventgate-main (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523792 (https://phabricator.wikimedia.org/T226522) (owner: 10Ottomata)
[23:14:44] <wikibugs>	 (03PS1) 10Ppchelko: Switch RESTBase evvnt production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T226522)
[23:18:48] <wikibugs>	 (03PS2) 10Ppchelko: Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T226522)
[23:19:13] <wikibugs>	 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @Joe I've updated the fork at https://github.com/mdholloway/nsfwoid according to your...
[23:29:02] <wikibugs>	 (03PS1) 10Catrope: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084)
[23:34:57] <wikibugs>	 (03PS1) 10Ppchelko: Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T226522)
[23:35:52] <wikibugs>	 (03PS2) 10Ppchelko: [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T226522)
[23:37:42] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope)
[23:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope)
[23:39:01] <wikibugs>	 (03CR) 10jenkins-bot: Deploy TheWikipediaLibrary to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524059 (https://phabricator.wikimedia.org/T132084) (owner: 10Catrope)
[23:40:36] <wikibugs>	 (03PS3) 10Ppchelko: [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T524055)
[23:41:02] <wikibugs>	 (03PS3) 10Ppchelko: Switch RESTBase event production to eventgate. Step 1. [puppet] - 10https://gerrit.wikimedia.org/r/524057 (https://phabricator.wikimedia.org/T524055)
[23:48:05] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wmgUseTheWikipediaLibrary (false everywhere, no-op) (duration: 00m 53s)
[23:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:21] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Add wmgUseTheWikipediaLibrary (false everywhere, no-op) (duration: 00m 54s)
[23:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:46] <wikibugs>	 (03PS1) 10Catrope: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061
[23:57:55] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope)
[23:58:53] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope)
[23:59:14] <wikibugs>	 (03CR) 10jenkins-bot: beta: Set $wgTwlEditCount to 100 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524061 (owner: 10Catrope)