[00:00:06] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.200 second response time
[00:01:06] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.528 second response time
[00:02:16] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct
[00:06:14] <James_F>	 Anyone SWATing?
[00:07:42] <greg-g>	 where's jouncebot?
[00:07:54] <James_F>	 Died in the NFS stuff maybe?
[00:12:31] <jdlrobson>	 :(
[00:12:48] <jdlrobson>	 greg-g: James_F who is on swat duty?
[00:13:33] <greg-g>	 addshore, Antoine (hashar), Brad (anomie), Chad (ostriches), Katie (aude), Max (MaxSem), Mukunda (twentyafterfour), Roan (RoanKattouw), Sébastien (Dereckson), or Tyler (thcipriani)
[00:13:51] <ostriches>	 You're not jouncebot!
[00:14:12] <greg-g>	 :)
[00:14:25] * greg-g waits for the "you can't tell me what to do" line from ostriches 
[00:15:14] <addshore>	 https://cdn.meme.am/cache/instances/folder225/500x/62235225.jpg
[00:15:29] * James_F grins.
[00:15:38] <thcipriani>	 I can SWAT
[00:15:44] <jdlrobson>	 W00TTTT
[00:16:00] <James_F>	 Thanks thcipriani.
[00:16:46] <jdlrobson>	 http://i.giphy.com/Clrnitk7xtHiM.gif
[00:16:54] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333124 (https://phabricator.wikimedia.org/T152743) (owner: 10Jdlrobson)
[00:18:49] <thcipriani>	 zuul is looking a bit overworked
[00:19:45] <wikibugs>	 (03Merged) 10jenkins-bot: Wikidata description taglines shown on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333124 (https://phabricator.wikimedia.org/T152743) (owner: 10Jdlrobson)
[00:19:56] <wikibugs>	 (03CR) 10jenkins-bot: Wikidata description taglines shown on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333124 (https://phabricator.wikimedia.org/T152743) (owner: 10Jdlrobson)
[00:20:36] <thcipriani>	 jdlrobson: your change is live on mwdebug1002, check please
[00:22:09] <volans>	 !log apt-upgrading nodejs to v6 on the rest of parsoid hosts (a deploy with restart will follow) T149331
[00:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:14] <stashbot>	 T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331
[00:22:15] <jdlrobson>	 yay thcipriani it works!
[00:22:23] <thcipriani>	 jdlrobson: ok, going live everywhere
[00:22:26] <jdlrobson>	 w00t
[00:24:19] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:333124|Wikidata description taglines shown on English Wikipedia]] T152743 (duration: 00m 39s)
[00:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:23] <stashbot>	 T152743: Deploy wikidata descriptions to mobile English Wikipedia stable - https://phabricator.wikimedia.org/T152743
[00:24:25] <thcipriani>	 ^ jdlrobson live everywhere
[00:25:30] <jdlrobson>	 thcipriani: confirmed working! THANKS!
[00:25:41] <thcipriani>	 cool, thanks for checking :)
[00:25:43] <James_F>	 Welcome back, jouncebot.
[00:25:45] <madhuvishy>	 i brought jouncebot back
[00:26:13] <James_F>	 Thanks madhuvishy.
[00:26:27] <thcipriani>	 jouncebot: now
[00:26:27] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170120T0000)
[00:26:42] <madhuvishy>	 np :)
[00:26:53] <thcipriani>	 jouncebot is my spirit animal
[00:29:59] <ostriches>	 !bash "jouncebot is my spirit animal" -- thcipriani
[00:29:59] <stashbot>	 ostriches: Stored quip at https://tools.wmflabs.org/bash/quip/AVm5R60YlCyyDMEPvDOX
[00:30:33] <thcipriani>	 James_F: oojsui fixes should be live on mwdebug1002, check please
[00:31:00] <p858snake|>	 back in the day when we could store quips in bugzilla >.> <.<
[00:31:45] <James_F>	 thcipriani: On it.
[00:32:25] <James_F>	 thcipriani: Yup!
[00:32:34] <thcipriani>	 James_F: okie doke, going live
[00:34:51] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.8/resources/lib/oojs-ui: SWAT: [[gerrit:333100|resources: Update OOjs UI with fixes on top of v0.18.3]] T155728 (duration: 00m 41s)
[00:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:55] <stashbot>	 T155728: Exception thrown when trying to open most dialogs in VE mobile - https://phabricator.wikimedia.org/T155728
[00:34:56] <thcipriani>	 ^ James_F live
[00:35:02] <James_F>	 Thanks!
[00:35:15] <James_F>	 Now just the Special:Contributions one.
[00:35:45] <logmsgbot>	 !log mobrovac@tin Starting deploy [parsoid/deploy@465f9c4]: Restarting Parsoid everywhere for Node v6 switch T149331
[00:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:49] <stashbot>	 T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331
[00:35:54] <mobrovac>	 volans: ^^^
[00:36:05] <volans>	 mobrovac: yep
[00:36:06] <thcipriani>	 James_F: yup, that is now live on mwdebug1002
[00:37:00] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2954890 (10Ocaasi_WMF) Apparently we were missed on the list and therefore not rebooted, so it's being recovered now.  Should fix it most likely.  Thanks!  -Jake
[00:38:27] <James_F>	 thcipriani: Yup, works.
[00:38:34] <thcipriani>	 k going live
[00:39:29] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.8/includes/specials/SpecialContributions.php: SWAT: [[gerrit:333127|SpecialContributions: Username input is not really required]] T155780 (duration: 00m 39s)
[00:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:33] <stashbot>	 T155780: Special:Contributions requires a username even when using "Show contributions of new accounts only" - https://phabricator.wikimedia.org/T155780
[00:39:36] <thcipriani>	 ^ James_F live everywhere
[00:40:05] <logmsgbot>	 !log mobrovac@tin Finished deploy [parsoid/deploy@465f9c4]: Restarting Parsoid everywhere for Node v6 switch T149331 (duration: 04m 21s)
[00:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:05] <James_F>	 Thank you, thcipriani.
[00:46:55] <mutante>	 !log setting all deployment key passphrases to the one used for mw deploy - update key files in private repo (T154943)
[00:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:59] <stashbot>	 T154943: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943
[00:48:16] <icinga-wm>	 PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100%
[00:48:46] <icinga-wm>	 RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[00:49:58] <mutante>	 !log mira - arming keyholder after setting service/dumps/eventlogging/phabricator key passphrases to the same one (T154943)
[00:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:26] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.942 second response time
[00:52:54] <mutante>	 !log tin - keyholder disarm and arm again using new passphrase
[00:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:46] <icinga-wm>	 PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[00:54:56] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:56:34] <mutante>	 oh come on, it's totally armed per "status"
[00:59:56] <volans>	 mutante: the paths are missing
[01:00:37] <volans>	 no sorry
[01:00:41] <mutante>	 volans: you mean in the output of "keyholder status"? that's just the key comments
[01:00:48] <mutante>	 which are not set
[01:00:59] <volans>	 mutante: I thought was that but actually is the ssh-add -l that is empty
[01:01:27] <volans>	 ok got it
[01:01:51] <volans>	 the command in /usr/lib/nagios/plugins/check_keyholder applied to the outout of ssh-add is returning just rsa
[01:02:02] <volans>	 because that should be the path
[01:02:23] <volans>	 and it doesnt' match with the sorted paths in configured_keys() that are taken from a find
[01:02:42] <volans>	 mutante: ^^^
[01:05:06] <icinga-wm>	 PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100%
[01:05:26] <icinga-wm>	 RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[01:05:46] <volans>	 chasemp, madhuvishy ^^^ expected?
[01:05:56] <madhuvishy>	 volans: man
[01:06:00] <madhuvishy>	 something's up with icinga
[01:06:23] <mutante>	 volans: hmm, i dont get why analytics would be the only one different
[01:06:34] <madhuvishy>	 the host has been up
[01:06:40] <madhuvishy>	 and looks okay
[01:06:50] <mutante>	 thanks, so i guess i removed the key comment by just not providing it
[01:06:51] <madhuvishy>	 this is the second time today it's erroneously alerting
[01:07:13] <volans>	 mutante: yeah, seems so
[01:07:43] <mutante>	 but weird how one key kept it
[01:07:52] <mutante>	 fixing
[01:07:55] <volans>	 ok
[01:08:08] <bd808>	 madhuvishy: hmmm... sporadic packet loss for some unknown reason?
[01:08:22] <madhuvishy>	 probably
[01:08:52] <bd808>	 bad nic, bad switch port, bad cable, arp madness... so many possibilities
[01:09:15] <volans>	 madhuvishy: https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=labstore1004
[01:09:54] <madhuvishy>	 crazy 
[01:10:02] <madhuvishy>	 yeah might even explain outage
[01:10:57] <madhuvishy>	 volans: this node isn't serving nfs right now - so i'm going to keep icinga silenced there for a couple hours, and investigate when I'm back
[01:11:24] <volans>	 ok
[01:11:42] <madhuvishy>	 volans: thanks for pointing it out!
[01:11:54] <volans>	 yw :)
[01:16:36] <icinga-wm>	 PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:22:56] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[01:25:05] <mutante>	 volans: "Comments are only supported for RSA1 keys.
[01:25:06] <mutante>	 :p
[01:25:30] <mutante>	 so even with -C / -c not adding the comment silently
[01:26:08] <volans>	 madhuvishy: FYI I've triend a ping from a random host (wtp1008): 1078 packets transmitted, 1047 received, 2% packet loss 
[01:26:37] <volans>	 they should be 0
[01:27:52] <volans>	 mutante: how did they have the comment before?
[01:28:02] <volans>	 maybe rsa1 is the only one that allow to change it?
[01:29:53] <mutante>	 volans: yes, that's it.  changing it only for rsa1
[01:30:00] <mutante>	 but removing it is apparently easy, meh
[01:30:18] <mutante>	  This operation is only supported for RSA1 keys. 
[01:31:04] <volans>	 adding them manually in the public key is not enough I guess... :-P
[01:40:16] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[01:44:36] <icinga-wm>	 RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[01:58:13] <wikibugs>	 (03PS2) 10Ema: icinga: critical on ripe atlas check exceptions [puppet] - 10https://gerrit.wikimedia.org/r/333093
[01:58:15] <wikibugs>	 (03PS1) 10Ema: Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333157 (https://phabricator.wikimedia.org/T155504)
[02:02:35] <ema>	 grr, pushed from the wrong branch
[02:02:57] <wikibugs>	 (03Abandoned) 10Ema: Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333157 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema)
[02:03:33] <wikibugs>	 (03PS1) 10Ema: Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504)
[02:07:16] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[02:29:56] <icinga-wm>	 RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys.
[02:30:57] <volans>	 mutante: \o/
[02:31:11] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.8) (duration: 11m 23s)
[02:31:14] <mutante>	 yea, but it's only a revert :/
[02:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:25] <mutante>	 i talked in #openssh for a while even
[02:32:05] <mutante>	 they say old ssh-add version .. but .. eh
[02:32:23] <mutante>	 and that's not the comment line, that's just the full path to the key
[02:32:55] <mutante>	 https://paste.pound-python.org/show/UOsCwU9cRuEPVdLKcLxm/  etc
[02:32:59] <volans>	 that disappear when changing the password?
[02:33:02] <mutante>	 yes
[02:33:07] <mutante>	 except for one of them
[02:33:10] <volans>	 WAT?!?!
[02:33:33] <mutante>	 18:18 < BasketCase> I am not sure why manual ssh-add doesn't add a comment
[02:33:43] <mutante>	 18:19 < BasketCase> also, manual ssh-add actually added the comment from the pub file for my ed25519 key but not my rsa key
[02:34:05] <mutante>	 18:22 < BasketCase> I am on 7.3 btw
[02:34:05] <mutante>	 18:22 < BasketCase> not whatever obsolete junk Debian has
[02:34:14] <volans>	 lol
[02:34:19] <volans>	 great :(
[02:34:31] <mutante>	 yea, eh, i am reverting to get back to it later
[02:34:45] <mutante>	 need a break for now
[02:34:57] <mutante>	 totally unexpected rabbit hole here
[02:34:59] <volans>	 ok, otherwise we could just create new ones
[02:35:04] <mutante>	 yes, true
[02:35:38] <volans>	 probably easier and also safer, as an not-scheduled key rotation :)
[02:36:45] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 20 02:36:44 UTC 2017 (duration 5m 34s)
[02:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:46] <icinga-wm>	 RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys.
[02:41:02] <wikibugs>	 (03PS1) 10Volans: /home: update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/333160
[02:42:30] <wikibugs>	 (03CR) 10Volans: [C: 032] /home: update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/333160 (owner: 10Volans)
[02:50:16] <icinga-wm>	 PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:18:16] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[03:34:55] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad)
[03:37:06] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.216 second response time
[03:38:06] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.423 second response time
[03:40:33] <wikibugs>	 (03PS1) 10Volans: Puppetmaster: remove temporary logging for debugging [puppet] - 10https://gerrit.wikimedia.org/r/333162 (https://phabricator.wikimedia.org/T128895)
[03:43:06] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.189 second response time
[03:45:00] <wikibugs>	 06Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2955175 (10Volans) @akosiaris it's probably time to remove this patch that I made a year ago, since it's not happening anymore. Thoughts?
[03:45:06] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.725 second response time
[04:23:46] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2955191 (10faidon)
[04:41:07] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2955204 (10faidon)
[04:56:14] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Monitor Certificate Transparency (CT) logs - https://phabricator.wikimedia.org/T155807#2955212 (10faidon)
[04:59:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:00:16] <icinga-wm>	 PROBLEM - DPKG on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:00:26] <icinga-wm>	 PROBLEM - Disk space on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:00:46] <icinga-wm>	 PROBLEM - MD RAID on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:01:26] <icinga-wm>	 PROBLEM - configured eth on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:01:36] <icinga-wm>	 PROBLEM - dhclient process on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:01:56] <icinga-wm>	 PROBLEM - puppet last run on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:02:06] <icinga-wm>	 PROBLEM - salt-minion processes on mw2253 is CRITICAL: Return code of 255 is out of bounds
[05:04:56] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[05:09:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2253 is OK: OK - running: The system is fully operational
[05:10:06] <icinga-wm>	 RECOVERY - salt-minion processes on mw2253 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[05:10:16] <icinga-wm>	 RECOVERY - DPKG on mw2253 is OK: All packages OK
[05:10:26] <icinga-wm>	 RECOVERY - configured eth on mw2253 is OK: OK - interfaces up
[05:10:26] <icinga-wm>	 RECOVERY - Disk space on mw2253 is OK: DISK OK
[05:10:36] <icinga-wm>	 RECOVERY - dhclient process on mw2253 is OK: PROCS OK: 0 processes with command name dhclient
[05:10:46] <icinga-wm>	 RECOVERY - MD RAID on mw2253 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[05:11:56] <icinga-wm>	 RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[05:29:13] <icinga-wm>	 PROBLEM - MD RAID on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:29:13] <icinga-wm>	 PROBLEM - salt-minion processes on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:29:53] <icinga-wm>	 PROBLEM - Check systemd state on mw2258 is CRITICAL: Return code of 255 is out of bounds
[05:29:53] <icinga-wm>	 PROBLEM - configured eth on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:30:13] <icinga-wm>	 PROBLEM - dhclient process on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:30:13] <icinga-wm>	 PROBLEM - DPKG on mw2258 is CRITICAL: Return code of 255 is out of bounds
[05:30:23] <icinga-wm>	 PROBLEM - Disk space on mw2258 is CRITICAL: Return code of 255 is out of bounds
[05:30:23] <icinga-wm>	 PROBLEM - puppet last run on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:30:33] <icinga-wm>	 PROBLEM - MD RAID on mw2258 is CRITICAL: Return code of 255 is out of bounds
[05:30:34] <icinga-wm>	 PROBLEM - salt-minion processes on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:30:53] <icinga-wm>	 PROBLEM - Check systemd state on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:31:03] <icinga-wm>	 PROBLEM - DPKG on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:31:23] <icinga-wm>	 PROBLEM - Disk space on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:31:23] <icinga-wm>	 PROBLEM - configured eth on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[05:31:23] <icinga-wm>	 RECOVERY - Disk space on mw2258 is OK: DISK OK
[05:31:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR
[05:31:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR
[05:31:33] <icinga-wm>	 PROBLEM - MD RAID on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:31:33] <icinga-wm>	 RECOVERY - MD RAID on mw2258 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[05:31:53] <icinga-wm>	 RECOVERY - Check systemd state on mw2258 is OK: OK - running: The system is fully operational
[05:32:13] <icinga-wm>	 RECOVERY - DPKG on mw2258 is OK: All packages OK
[05:32:13] <icinga-wm>	 PROBLEM - configured eth on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:32:13] <icinga-wm>	 PROBLEM - Check systemd state on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:32:23] <icinga-wm>	 RECOVERY - configured eth on mw2258 is OK: OK - interfaces up
[05:32:33] <icinga-wm>	 PROBLEM - DPKG on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:32:33] <icinga-wm>	 PROBLEM - dhclient process on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:32:43] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[05:32:43] <icinga-wm>	 PROBLEM - Disk space on mw2259 is CRITICAL: Return code of 255 is out of bounds
[05:32:43] <icinga-wm>	 PROBLEM - puppet last run on mw2260 is CRITICAL: Return code of 255 is out of bounds
[05:33:33] <icinga-wm>	 RECOVERY - dhclient process on mw2260 is OK: PROCS OK: 0 processes with command name dhclient
[05:33:33] <icinga-wm>	 RECOVERY - MD RAID on mw2260 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[05:33:33] <icinga-wm>	 RECOVERY - salt-minion processes on mw2259 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[05:33:43] <icinga-wm>	 RECOVERY - Disk space on mw2259 is OK: DISK OK
[05:33:53] <icinga-wm>	 RECOVERY - Check systemd state on mw2260 is OK: OK - running: The system is fully operational
[05:33:53] <icinga-wm>	 RECOVERY - configured eth on mw2259 is OK: OK - interfaces up
[05:34:03] <icinga-wm>	 RECOVERY - DPKG on mw2260 is OK: All packages OK
[05:34:13] <icinga-wm>	 RECOVERY - dhclient process on mw2259 is OK: PROCS OK: 0 processes with command name dhclient
[05:34:13] <icinga-wm>	 RECOVERY - salt-minion processes on mw2260 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[05:34:13] <icinga-wm>	 RECOVERY - MD RAID on mw2259 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[05:34:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational
[05:34:13] <icinga-wm>	 RECOVERY - configured eth on mw2260 is OK: OK - interfaces up
[05:34:23] <icinga-wm>	 RECOVERY - Disk space on mw2260 is OK: DISK OK
[05:34:23] <icinga-wm>	 RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[05:34:33] <icinga-wm>	 RECOVERY - DPKG on mw2259 is OK: All packages OK
[05:34:43] <icinga-wm>	 RECOVERY - puppet last run on mw2260 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[05:37:34] <wikibugs>	 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2955238 (10Papaul)
[05:38:20] <wikibugs>	 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Papaul)
[05:38:48] <wikibugs>	 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Papaul)
[05:39:28] <wikibugs>	 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Papaul) a:05Papaul>03Joe @joe installation complete.
[05:42:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[05:42:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[06:16:03] <icinga-wm>	 PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:16:53] <icinga-wm>	 RECOVERY - Host labstore1004 is UP: PING WARNING - Packet loss = 50%, RTA = 0.24 ms
[06:30:53] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:32:58] <wikibugs>	 06Operations: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#2955264 (10Dzahn) Reverted to the old keys for the moment. When changing passphrase with ssh-keygen -p -f .. a side-effect was that ssh-add -l does not show full path anymore as default comment, only shows...
[06:33:23] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.319 second response time
[06:38:23] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.596 second response time
[06:42:43] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:45:33] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:46:09] <_joe_>	 this ^^ is a known problem I still have to fix
[06:47:00] <wikibugs>	 06Operations: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#2955275 (10Dzahn) ``` 17:54 < mutante> !log tin - keyholder disarm and arm again using new passphrase 16:51 < mutante> !log mira - arming keyholder after setting service/dumps/eventlogging/phabricator key p...
[06:52:53] <icinga-wm>	 PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:53:43] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:54:53] <wikibugs>	 06Operations, 10Parsoid, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2955281 (10Joe) @mobrovac when I read the task I was as surprised as you, given I remember we did create those rules correctly (although I think the copytruncate is on purpose).  I'll take a look; rut...
[07:07:21] <wikibugs>	 06Operations, 10DBA: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1579303 (10Marostegui) db1035 looks good now  ``` root@db1035:/srv/sqldata# df -hT /srv/ Filesystem            Type  Size  Used Avail Use% Mounted on /dev/mapper/tank-data xfs   1.6T  1.1T  585G  64% /srv ```
[07:09:27] <marostegui>	 !log Compress pagelinks tables on db1015 - T153739
[07:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:32] <stashbot>	 T153739: Defragment db1015 - https://phabricator.wikimedia.org/T153739
[07:15:32] <wikibugs>	 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955290 (10yuvipanda) @akosiaris hmm I'd really like to keep the pin in puppet - there's enough uncertainity as is without having to find docker version mis...
[07:19:39] <wikibugs>	 06Operations, 10Parsoid, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2955291 (10Joe) so, mystery solved.  When on systemd, `service::node` uses `systemd::syslog`, which takes care of setting up the rsyslog entries and everything else, including the logrotate rule.  Whi...
[07:20:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: lvs: lower the depool threshold for API [puppet] - 10https://gerrit.wikimedia.org/r/333166
[07:20:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: systemd::syslog: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/333167 (https://phabricator.wikimedia.org/T155768)
[07:21:53] <icinga-wm>	 RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[07:24:33] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[07:30:07] <wikibugs>	 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2955295 (10yuvipanda) If we have only one version it also means we are tying the prod and tools versions together forever, with upgrades needing to happen a...
[07:30:51] <wikibugs>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2953992 (10Marostegui) Hello,   That is indeed present on all the servers for commonswiki. It is also present at enwiki for instance.  I have checked oth...
[07:35:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332997 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[07:37:03] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332997 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[07:38:37] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T153300 (duration: 00m 48s)
[07:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:40] <stashbot>	 T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300
[07:44:20] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332997 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[07:46:43] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[08:13:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "For the record: this will allow to depool all servers used for async processing (CP/rb/parsoid) and still be able to serve all the traffic" [puppet] - 10https://gerrit.wikimedia.org/r/333166 (owner: 10Giuseppe Lavagetto)
[08:14:04] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: lvs: lower the depool threshold for API [puppet] - 10https://gerrit.wikimedia.org/r/333166
[08:15:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: lower the depool threshold for API [puppet] - 10https://gerrit.wikimedia.org/r/333166 (owner: 10Giuseppe Lavagetto)
[08:21:21] <wikibugs>	 (03PS3) 10Muehlenhoff: Grant temporary access to labsdb replica from Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/332457 (https://phabricator.wikimedia.org/T155487)
[08:25:25] <_joe_>	 !log restarting pybal on lvs1003/1006 to pick up config changes
[08:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:18] <marostegui>	 !log Remove partitions on metawiki.pagelinks db2047 - T153300
[08:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:22] <stashbot>	 T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300
[08:54:26] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor can't render a few SVGs that Mediawiki can - https://phabricator.wikimedia.org/T150754#2955425 (10Gilles) 05Open>03Resolved
[08:55:29] <_joe_>	 gilles: do you think thumbor will be ready to replace the scalers in the nearby future?
[08:55:44] <_joe_>	 I wanted to understand if we need it in codfw as well for the switchover or not
[08:59:19] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2955435 (10Ladsgroup) I realized I don't need hue.wikimedia.org access (I thought it's quarry.wmflabs.org for hadoop which I was wrong) but I like the elephan...
[09:10:23] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.281 second response time
[09:11:23] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.424 second response time
[09:19:26] <jynus>	 !log rolling restart and upgrade of labsdb1009/10/11 to mariadb 10.1.21-2
[09:19:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:53] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[09:23:43] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:23:43] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[09:29:48] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Split TMH transcode queue into two for prioritization [puppet] - 10https://gerrit.wikimedia.org/r/331668 (https://phabricator.wikimedia.org/T155098) (owner: 10Brion VIBBER)
[09:31:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM, but I'd wait until we want to actually bring the feature in production before merging this, or we'd reduce videoscalers capacity sig" [puppet] - 10https://gerrit.wikimedia.org/r/331668 (https://phabricator.wikimedia.org/T155098) (owner: 10Brion VIBBER)
[09:37:28] <chasemp>	 ^ yuvi I think maintain-dbusrs is failing consistently and I imagine it's due to jynus's upgrade going on for 1009/10/10
[09:37:49] <yuvipanda>	 chasemp: ok, I'll look now
[09:37:54] <yuvipanda>	 it could also be just failing from the failover
[09:38:04] <yuvipanda>	 chasemp: 1005 is still primary right
[09:38:07] <chasemp>	 I mean, maybe but it's been fine until now
[09:38:07] <chasemp>	 yeah
[09:38:29] <yuvipanda>	 onalError: (2003, "Can't connect to MySQL server on 'labsdb1011.eqiad.wmnet' ([Errno 111] Connection refused)")
[09:38:35] <yuvipanda>	 maybe firewall
[09:38:43] <yuvipanda>	 chasemp: yeah, it wouldn't fail until someone actually tries to create a tool 
[09:38:43] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[09:39:03] <yuvipanda>	 it failed again
[09:39:05] <chasemp>	 yuvipanda: see jynus upgrading mysql there 
[09:39:11] <chasemp>	 per log I believe
[09:39:15] <jynus>	 yes
[09:39:24] <yuvipanda>	 oh, I didn't catch that
[09:39:29] <yuvipanda>	 ok, sorry, that'd make sense
[09:39:35] <yuvipanda>	 I'll keep a watch, and ack the check
[09:39:42] <jynus>	 it is ok
[09:39:46] <chasemp>	 I wonder if sometime it wouldn't make sense to disable new tool creation during DB maint
[09:39:46] <jynus>	 it should recover
[09:39:51] <yuvipanda>	 (it was failing when I went to sleep right after failover, so I assumed it's been dead since)
[09:40:01] <yuvipanda>	 chasemp: it'll catch up after the db is back tho
[09:40:21] <chasemp>	 even if it hits 5 out of 6 servers it will finish up that 6th one later?
[09:40:30] <chasemp>	 (honestly just unsure how it was written)
[09:40:58] <jynus>	 that should indeed be taken into account- I can assure the service up, not the servers
[09:41:05] <jynus>	 maybe I can add a ticket
[09:41:09] <jynus>	 to investigate later
[09:41:16] <yuvipanda>	 chasemp: yes
[09:41:23] <yuvipanda>	 it keeps state per-host per-user
[09:41:30] <chasemp>	 ok, neat
[09:41:43] <yuvipanda>	 chasemp: I wrote it that way so we can easily add / remove servers later
[09:41:43] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[09:41:57] <chasemp>	 so then thing to do is either disable new tool creation during or silence the check but either is same outcome
[09:42:36] <chasemp>	 yuvipanda: yeah makes sense
[09:42:38] <yuvipanda>	 yeah. I think we should just silence the check as part of doing maint on labsdbs
[09:42:48] <icinga-wm>	 ACKNOWLEDGEMENT - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed Yuvi Panda maint on lasbsdbs
[09:43:06] <jynus>	 ack wont work
[09:43:11] <jynus>	 it will fail 3 times
[09:46:38] <yuvipanda>	 right. I'll set downtime
[09:47:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: systemd::syslog: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/333167 (https://phabricator.wikimedia.org/T155768)
[09:51:09] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2955191 (10Krenair) payments.wikimedia.org is only a single hostname cert though right?  I haven't looked into the details of CAA but hopefully it would be possible to set up such an exception just fo...
[09:51:56] <wikibugs>	 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#2955546 (10MoritzMuehlenhoff) p:05Triage>03Normal
[09:52:43] <wikibugs>	 (03PS1) 10Jcrespo: Update control files for mariadb 10.0 and 10.1 packages [software] - 10https://gerrit.wikimedia.org/r/333221
[09:53:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: enable back shadow traffic to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/333222 (https://phabricator.wikimedia.org/T151851)
[09:54:26] <wikibugs>	 (03PS2) 10Jcrespo: Update control files for mariadb 10.0 and 10.1 packages [software] - 10https://gerrit.wikimedia.org/r/333221
[09:56:20] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2955552 (10Marostegui) >>! In T145885#2953741, @demon wrote: >>>! In T145885#2951958, @Marostegui wrote: >> That will convert ALL...
[09:57:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] swift: enable back shadow traffic to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/333222 (https://phabricator.wikimedia.org/T151851) (owner: 10Filippo Giunchedi)
[09:59:06] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Update control files for mariadb 10.0 and 10.1 packages [software] - 10https://gerrit.wikimedia.org/r/333221 (owner: 10Jcrespo)
[10:02:14] <godog>	 !log reload swift-proxy on ms-fe1001 to pick up https://gerrit.wikimedia.org/r/333222
[10:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:02] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2955567 (10Marostegui) See: T145885#2951958  This change requires us to change 2 global flags on the host.  As well as doing the AL...
[10:15:37] <moritzm>	 !log installing exim bugfix updates from latest jessie point release
[10:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] systemd::syslog: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/333167 (https://phabricator.wikimedia.org/T155768) (owner: 10Giuseppe Lavagetto)
[10:19:25] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: systemd::syslog: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/333167 (https://phabricator.wikimedia.org/T155768)
[10:19:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] systemd::syslog: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/333167 (https://phabricator.wikimedia.org/T155768) (owner: 10Giuseppe Lavagetto)
[10:30:57] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2955640 (10Gilles) Issue when deployed, the X-Forwarded-For header in production is a list of IPs, which might explain why the throttle kicks in incorectly f...
[10:31:56] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2955642 (10Gilles) The swift loader has a noisy error, I have to check if it's only legit 404s:  ``` Jan 20 10:31:37 thumbor1002 thumbor@8817[37138]:...
[10:34:29] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2955644 (10Gilles) Also, those 404s are not making it to the 404 log anymore. Possibly because the filtering for the 404 log was based on an error fro...
[10:37:11] <wikibugs>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2955645 (10Joe) Extracting from the session outcomes:  What we want to do: * Stop relying in co...
[10:38:31] <wikibugs>	 (03PS1) 10Yuvipanda: labs: Only include nfsclient if *any* nfs mounts are enabled [puppet] - 10https://gerrit.wikimedia.org/r/333227
[10:39:27] <wikibugs>	 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2955646 (10Joe) Problem is now fixed and not just for parsoid.
[10:39:38] <wikibugs>	 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2955647 (10Joe) 05Open>03Resolved a:03Joe
[10:39:43] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[10:39:43] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[10:39:47] <elukey>	 !log manually forcing a /etc/init.d/apache2 reload on mw1259 (videoscaler) to replicate the effects of a logrotate run and test why alarms go off.
[10:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:53] <_joe_>	 elukey: stop
[10:40:08] <_joe_>	 elukey: is it /etc/init.d or upstart?
[10:40:13] <_joe_>	 still on init.d on trusty?
[10:40:27] <_joe_>	 yeah, it is, yuck
[10:41:35] <elukey>	 I checked on logrotate :)
[10:42:07] <elukey>	 so the scoreboard remains the same, but the busyworkers field goes down to 1
[10:42:28] <elukey>	 HHVM check health shows steady load
[10:42:43] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[10:42:55] <wikibugs>	 (03PS1) 10Gilles: Remove broken Thumbor IP throttling from configuration [puppet] - 10https://gerrit.wikimedia.org/r/333228 (https://phabricator.wikimedia.org/T151066)
[10:44:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Remove broken Thumbor IP throttling from configuration [puppet] - 10https://gerrit.wikimedia.org/r/333228 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles)
[10:44:43] <elukey>	 the alarms was expected :)
[10:44:47] <elukey>	 *alarm
[10:48:52] <godog>	 !log reload swift-proxy on ms-fe100* to pick up https://gerrit.wikimedia.org/r/333222
[10:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:51] <wikibugs>	 (03PS1) 10Hashar: labstore: check should search for exact mount match [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820)
[10:55:55] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2955670 (10Gilles) Could be related, some metrics aren't being reported on Grafana anymore:  https://grafana.wikimedia.org/dashboard/db/thumbor
[10:57:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Update control files for mariadb 10.0 and 10.1 packages [software] - 10https://gerrit.wikimedia.org/r/333221 (owner: 10Jcrespo)
[11:00:57] <wikibugs>	 (03CR) 10Hashar: "Cherry picked it on the integration puppet master and that fix it.  I have tested it still match a /home mount but better double check :}" [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar)
[11:01:30] <wikibugs>	 (03PS1) 10Faidon Liambotis: Setup & configure certspotter [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807)
[11:02:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup & configure certspotter [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807) (owner: 10Faidon Liambotis)
[11:02:30] <paravoid>	 grumble
[11:03:26] <wikibugs>	 (03PS2) 10Faidon Liambotis: Setup & configure certspotter [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807)
[11:07:04] <wikibugs>	 (03PS3) 10Faidon Liambotis: docker: cleanup the custom apt repository stanzas [puppet] - 10https://gerrit.wikimedia.org/r/327243
[11:08:32] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] docker: cleanup the custom apt repository stanzas [puppet] - 10https://gerrit.wikimedia.org/r/327243 (owner: 10Faidon Liambotis)
[11:24:51] <wikibugs>	 (03CR) 10Elukey: [C: 031] "Awesome! The sooner this gets merged the better, so we'll be able to run some tests." [puppet] - 10https://gerrit.wikimedia.org/r/332457 (https://phabricator.wikimedia.org/T155487) (owner: 10Muehlenhoff)
[11:33:38] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#2955722 (10fgiunchedi) thanks Chris! it looks like ms-fe1008 issue with the installer is an instance of {T149845} for which we don't have a root cause yet. I was able to fix it by manually...
[11:35:43] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[11:38:43] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[11:39:18] <wikibugs>	 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2955738 (10hashar) That is still happening. Happened today when creat...
[11:41:36] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling: Update npm to 3 or 4 - https://phabricator.wikimedia.org/T155488#2944663 (10hashar) p:05Triage>03Normal
[11:42:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728)
[11:43:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "If this is temporary, restrict to hadoop nodes, not arbitrary hosts." [puppet] - 10https://gerrit.wikimedia.org/r/332457 (https://phabricator.wikimedia.org/T155487) (owner: 10Muehlenhoff)
[11:45:43] <Amir1>	 Hey Ops. Since we don't have any SWAT window for today and (obviously) weekend. Can I deploy the patch for this UBN! bug? https://phabricator.wikimedia.org/T155500
[11:47:18] <_joe_>	 seems serious enough to grant a deploy
[11:47:25] <_joe_>	 but you need someone to review your code
[11:48:10] <Amir1>	 _joe_: the patch is reviewed and merged by someone else (a colleague from WMDE)
[11:48:33] <Amir1>	 Thanks! I start the deployment now
[11:48:36] <_joe_>	 oh I didn't see that
[11:48:50] <_joe_>	 do you need assistance in actually deployng this?
[11:49:05] <Amir1>	 I'll ask if I run into problems
[11:49:13] <Amir1>	 I deployed mediawiki before 
[11:49:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Actually, thinking better, access to the DBs should never be granted except for administration and monitoring purposes, there is a proxy t" [puppet] - 10https://gerrit.wikimedia.org/r/332457 (https://phabricator.wikimedia.org/T155487) (owner: 10Muehlenhoff)
[11:49:49] <elukey>	 jynus: we usually use ANALYTICS_NETWORKS since afaik we don't have a list of hadoop nodes in puppet to use as ferm whitelist.. If you strongly need a stricter ferm rule I'll come up with a workaround, but for the moment it might be fine to just allow the ANALYTICS_NETWORK?
[11:49:59] <elukey>	 ah nice -2 now :P
[11:50:42] <jynus>	 you are opening a whole to the dbs, that only root should have access too
[11:50:51] <jynus>	 there are proxies in place 
[11:51:05] <jynus>	 to channel and control access
[11:52:11] <_joe_>	 so instead of -2, you might want to indicate where this rule should point to?
[11:52:23] <jynus>	 no need for rules
[11:52:29] <jynus>	 the proxies are open everywhere
[11:52:38] <elukey>	 very nice, didn't know it :)
[11:53:11] <jynus>	 if they cannot access the proxies, it is a vlan limitation
[11:53:17] <jynus>	 not iptables
[11:54:15] <elukey>	 jynus: so if I got it correctly, dbproxy1010.eqiad.wmnet and dbproxy1011.eqiad.wmnet should be enough for our use case?
[11:54:21] <jynus>	 no
[11:54:31] <jynus>	 you use labsdb-analytics.eqiad.wmnet
[11:54:45] <jynus>	 proxies will take care of redirecting things to the right place
[11:56:57] <elukey>	 ah ok it is a cname for dbproxy1010.eqiad.wmnet, nice
[11:57:21] <elukey>	 all right even better, thanks :)
[11:57:26] <elukey>	 Joseph will be happy
[11:57:47] <jynus>	 why didn't Joseph talk to me?
[11:59:25] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Grant temporary access to labsdb replica from Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/332457 (https://phabricator.wikimedia.org/T155487) (owner: 10Muehlenhoff)
[11:59:44] <elukey>	 He trusted his ops engineer (bad choice) and we asked to Moritz.. We would have alerted you but I saw you in the code review so I waited your opinion
[12:00:06] <jynus>	 I have 200 code reviews I am in
[12:00:15] <jynus>	 I only see them if I am pinged
[12:00:16] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2955790 (10MoritzMuehlenhoff) p:05Triage>03Normal
[12:00:35] <elukey>	 it was an attempt to tell you that next time we'll ping you directly :)
[12:00:41] <jynus>	 I am not even involved on https://phabricator.wikimedia.org/T155658
[12:00:54] <jynus>	 labsdbs are a new, still undocumented proyect
[12:01:01] <jynus>	 you will need my help
[12:01:10] <jynus>	 still WIP
[12:01:24] <jynus>	 I am not saying it should not be used
[12:01:37] <jynus>	 I am saying that you will need support from me or Manuel
[12:02:30] <elukey>	 sure, I believe that at some point you guys would have been contacted. The idea was to test getting data from labs db with this patch to make sure that everything would have worked
[12:02:47] <elukey>	 but you are right and we should have pinged you even before starting
[12:03:15] <elukey>	 (I was convinced that you were aware after SF about this project but I was wrong)
[12:03:38] <jynus>	 no, Joel discussed me about importing revision data
[12:03:42] <jynus>	 *with
[12:03:49] <jynus>	 nothing about labsdb metadata
[12:03:58] <jynus>	 no more than "it will be useful"
[12:04:40] <jynus>	 I have 0 problems with it, it is just that ask for help and I will make life much easier for you
[12:04:48] <jynus>	 like this proxy
[12:04:55] <elukey>	 yep yep thanks a lot
[12:05:25] <elukey>	 can Joseph run some tests over the weekend with the labsdb-analytics proxy or should he wait to talk with you first?
[12:05:38] <jynus>	 why wait on me
[12:05:48] <jynus>	 you should never get blocked on me
[12:05:56] <elukey>	 just asking to be sure, that's it :)
[12:06:11] <jynus>	 the only concern is that
[12:06:13] <elukey>	 all right Joseph is out today but I'll send an email with a recap
[12:06:25] <jynus>	 if the server goes down, it will take weeks to recover
[12:06:43] <jynus>	 because we cannot do gtids yet on labsdbs
[12:06:53] <icinga-wm>	 PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:07:16] <elukey>	 all right so use with extreme care
[12:13:10] <Amir1>	 !log deploy wmf.8 in mwdebug1002 (T155500)
[12:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:13] <stashbot>	 T155500: Fatal exception of type "DBQueryError" on sorting ORES contributions - https://phabricator.wikimedia.org/T155500
[12:16:53] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:18:33] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 2.033 second response time
[12:18:38] <wikibugs>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Expand conftool to support multiple objects via a schema definition. - https://phabricator.wikimedia.org/T155823#2955826 (10Joe)
[12:19:33] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.377 second response time
[12:25:16] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2955845 (10Esc3300) @Lydia_Pintscher  To check where this should go, could we do  T154017 ?
[12:27:55] <Amir1>	 After some going back and forth I confirm it's working
[12:28:05] <Amir1>	 deploying to all
[12:32:15] <logmsgbot>	 !log ladsgroup@tin Synchronized php-1.29.0-wmf.8/extensions/ORES/includes/Hooks.php: [[gerrit:333226|ORES database query fix]] (T155500) (duration: 00m 40s)
[12:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:18] <stashbot>	 T155500: Fatal exception of type "DBQueryError" on sorting ORES contributions - https://phabricator.wikimedia.org/T155500
[12:34:47] <Amir1>	 live everywhere and works as expected 
[12:35:53] <icinga-wm>	 RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[12:36:49] <elukey>	 nice! 
[12:42:42] <Amir1>	 Thanks elukey 
[12:45:53] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[12:56:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove gehel from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333240 (https://phabricator.wikimedia.org/T142836)
[13:00:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836)
[13:01:58] <wikibugs>	 (03CR) 10Elukey: [C: 031] "Bad Analytics ops are bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[13:14:33] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.858 second response time
[13:14:35] <wikibugs>	 06Operations: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2955932 (10MoritzMuehlenhoff) This can also be used as a data/synchronisation point whether someone is staff (i.e. if using a wikimedia.org address) or not (useful for cross-validating NDA status as well).
[13:15:06] <wikibugs>	 06Operations: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2955936 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[13:15:23] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.696 second response time
[13:16:22] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2955937 (10Paladox) I have never tested with a master and a replicate so not sure if it will work.  I've only tested it on the main...
[13:22:13] <icinga-wm>	 PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:24:15] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333245
[13:27:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333245 (owner: 10Marostegui)
[13:28:46] <wikibugs>	 06Operations: Package the next LTS kernel (likely 4.9) - https://phabricator.wikimedia.org/T154934#2955949 (10MoritzMuehlenhoff) Now confirmed as the next LTS kernel: http://lkml.iu.edu/hypermail/linux/kernel/1701.2/03438.html
[13:29:23] <wikibugs>	 06Operations: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#2955950 (10MoritzMuehlenhoff)
[13:29:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333245 (owner: 10Marostegui)
[13:29:47] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333245 (owner: 10Marostegui)
[13:30:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 - T153300 (duration: 00m 39s)
[13:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:40] <stashbot>	 T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300
[13:50:13] <icinga-wm>	 RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[13:54:23] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.412 second response time
[13:55:23] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.495 second response time
[13:57:36] <wikibugs>	 (03PS7) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455)
[13:57:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247
[13:59:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi)
[13:59:28] <godog>	 wha wha
[14:02:03] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[14:04:26] <elukey>	 godog: whaaaa?
[14:04:54] <godog>	 elukey: I got sad_trombone.wav going on when jenkins -1s me
[14:06:28] <wikibugs>	 (03PS8) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455)
[14:10:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Needed for related https://gerrit.wikimedia.org/r/#/c/310549/" [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi)
[14:15:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.44 [debs/linux44] - 10https://gerrit.wikimedia.org/r/333250
[14:26:50] <godog>	 gah, gerrit patchsets downloaded as .tgz create files in '.'
[14:29:31] <wikibugs>	 (03PS3) 10Jcrespo: m1,m2,m3,m4,m5.hosts: Add new host files [software] - 10https://gerrit.wikimedia.org/r/332747 (owner: 10Marostegui)
[14:29:53] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[14:30:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] m1,m2,m3,m4,m5.hosts: Add new host files [software] - 10https://gerrit.wikimedia.org/r/332747 (owner: 10Marostegui)
[14:32:39] <wikibugs>	 (03CR) 10Marostegui: [C: 032] "Thanks" [software] - 10https://gerrit.wikimedia.org/r/332747 (owner: 10Marostegui)
[14:37:36] <wikibugs>	 (03Merged) 10jenkins-bot: m1,m2,m3,m4,m5.hosts: Add new host files [software] - 10https://gerrit.wikimedia.org/r/332747 (owner: 10Marostegui)
[14:41:56] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2956039 (10Dzahn) >>! In T155764#2955567, @Marostegui wrote: > This is m2 and the other databases on this host are: > bugzilla_test...
[14:42:07] <jynus>	 !log restart and upgrade of db2067
[14:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5174/" [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi)
[14:49:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR
[14:49:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR
[14:51:09] <wikibugs>	 06Operations, 07Puppet, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2956045 (10Joe)
[14:51:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a first batch of email addresses to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/333259
[14:51:56] <wikibugs>	 (03PS3) 10Andrew Bogott: graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/328608 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[14:55:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/328608 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[14:58:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[14:58:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[15:02:13] <icinga-wm>	 PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-novaclient]
[15:20:55] <wikibugs>	 (03PS1) 10Andrew Bogott: Apt:  Add a proxy to grab openstack mitaka packages from Mirantis [puppet] - 10https://gerrit.wikimedia.org/r/333263
[15:28:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "graphite: Don't use wikitech API to find labs projects/instances" [puppet] - 10https://gerrit.wikimedia.org/r/333264
[15:30:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Revert "graphite: Don't use wikitech API to find labs projects/instances" [puppet] - 10https://gerrit.wikimedia.org/r/333264 (owner: 10Andrew Bogott)
[15:32:13] <icinga-wm>	 RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[15:36:50] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355
[15:37:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Nice work! I like especially the query language and puppetdb querying. Some random comments inline." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:39:28] <wikibugs>	 (03PS2) 10Andrew Bogott: Apt:  Add a proxy to grab openstack mitaka packages from Mirantis [puppet] - 10https://gerrit.wikimedia.org/r/333263
[15:39:30] <wikibugs>	 (03PS1) 10Andrew Bogott: graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/333267 (https://phabricator.wikimedia.org/T104575)
[15:42:07] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2956092 (10Paladox) @demon would you think what @Marostegui said would work. (master to replication)?
[15:46:02] <wikibugs>	 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10MoritzMuehlenhoff) But reprepro somewhat supports multiple versions as long as they're stored in different sections (or whatever the exact termin...
[15:47:03] <wikibugs>	 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956102 (10yuvipanda) We'll have to create maybe a 'labs' section in reprepo and use it?
[15:48:29] <wikibugs>	 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2956105 (10MoritzMuehlenhoff) That or maybe "staging" to make it a little more generic.
[15:49:28] <wikibugs>	 (03CR) 10Krinkle: Configure RCFeeds to use EventBus extension to send recentchange events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata)
[16:01:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] "Moritz- and yuvi-approved" [puppet] - 10https://gerrit.wikimedia.org/r/333263 (owner: 10Andrew Bogott)
[16:03:20] <jynus>	 !log restart and upgrade of db2066
[16:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:01] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "We should remember to update https://wikitech.wikimedia.org/wiki/Scholarships.wikimedia.org#Where_are_the_logs.3F after the change is merg" [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[16:11:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Cherry-picked in tools, works correctly" [puppet] - 10https://gerrit.wikimedia.org/r/332355 (owner: 10Giuseppe Lavagetto)
[16:18:16] <cmjohnson1>	 !log swapping cable eth0 labstore1004 (chasemp)
[16:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:58] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.21 seconds
[16:24:11] <wikibugs>	 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services (watching): Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2079854 (10Eevans) Weird:  ``` $ sudo tcpdump -ni eth0 src host restbase1010-a.eqiad.wmnet and proto TCP and src por...
[16:24:15] <godog>	 uh oh?
[16:24:27] <marostegui>	 checking db1045
[16:24:31] <jynus>	 m
[16:24:43] <marostegui>	 you?
[16:26:00] <jynus>	 someone is altering a table there
[16:26:04] <marostegui>	 interesting the alter table is causing replication issue for the first time
[16:26:13] <marostegui>	 it is me, but it has never caused lag as it is online
[16:26:20] <marostegui>	 I will kill it
[16:26:26] <jynus>	 wait
[16:26:50] <jynus>	 I assume that is a dump slave
[16:27:06] <jynus>	 metadata locking kicked in
[16:27:38] <wikibugs>	 (03CR) 10BryanDavis: "> a warning next to it about "No cache mapping for this field"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz)
[16:27:49] <jynus>	 did you run that manually?
[16:28:29] <marostegui>	 yep
[16:28:34] <apergos>	 yeah it's vslow dumps indeed
[16:28:36] <jynus>	 without the wrapper?
[16:28:38] <marostegui>	 yeah, it is one of them
[16:28:48] <marostegui>	 jynus: yeah, no wrapper
[16:28:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] graphite: Don't use wikitech API to find labs projects/instances [puppet] - 10https://gerrit.wikimedia.org/r/333267 (https://phabricator.wikimedia.org/T104575) (owner: 10Andrew Bogott)
[16:28:53] <jynus>	 because the whole idea of the wrapper is to prevent metadata locks
[16:29:16] <jynus>	 specially on dump-vslow hosts
[16:29:27] <jynus>	 you can kill it, it has not even started
[16:29:35] <marostegui>	 this is the first slave that has this problem across all the ones I have altered already
[16:29:42] <marostegui>	 could be maybe the first vslow one
[16:29:52] <jynus>	 it has been locked for 8 hours
[16:30:01] <marostegui>	 killed
[16:30:06] <apergos>	 well that is probably around when the dump run started
[16:30:10] <apergos>	 2rnd run of the month
[16:30:14] <jynus>	 "Waiting for table metadata lock"
[16:30:31] <jynus>	 the wrapper lowers the innodb and sql timeouts to prevent those issues
[16:30:48] <marostegui>	 alter is gone
[16:31:36] <marostegui>	 it is catching up now
[16:32:11] <jynus>	 so vslow and dump ones are likly to create metadata locks
[16:32:35] <jynus>	 because several hour-selects
[16:32:52] <jynus>	 either run it with the wrapper or depool, wait some hours
[16:33:09] <jynus>	 worst thing that happened here is the page
[16:33:35] <jynus>	 no production impact, although we may have slowed down the dump process a few hours
[16:33:44] <marostegui>	 yeah
[16:33:52] <apergos>	 we'll survive
[16:34:01] <apergos>	 and it would only be slow on that one shard
[16:34:03] <marostegui>	 I think this might be the first vslow one that goes thru compress in the batch I did this week
[16:35:45] <wikibugs>	 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services (watching): Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2956203 (10Eevans) >>! In T128590#2956184, @Eevans wrote:  [ ... ]  > ``` > $ sudo tcpdump -ni eth0 src host restbas...
[16:36:52] <wikibugs>	 (03PS1) 10Dzahn: dumps: remove nginx for download.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/333272 (https://phabricator.wikimedia.org/T107575)
[16:36:58] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Replication lag: 4.26 seconds
[16:36:59] <marostegui>	 lag is gone
[16:39:46] <wikibugs>	 (03CR) 10Dzahn: "This is not really guaranteed to be true: "Using a wikimedia.org address denotes staff status (either employee or contractor), useful for " [puppet] - 10https://gerrit.wikimedia.org/r/333259 (owner: 10Muehlenhoff)
[16:43:22] <wikibugs>	 (03CR) 10Dzahn: "download.wikimedia.org has address 208.80.154.224" [puppet] - 10https://gerrit.wikimedia.org/r/333272 (https://phabricator.wikimedia.org/T107575) (owner: 10Dzahn)
[16:47:05] <mutante>	 apergos: ^ removing nginx config for download.wm.org since .. it's on the cluster nowadays
[16:47:12] <mutante>	 well, up for review i mean
[16:48:29] <apergos>	 ok
[16:57:28] <wikibugs>	 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2956240 (10RobH) 05stalled>03Resolved
[16:58:13] <wikibugs>	 06Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#2956241 (10RobH) 05stalled>03Resolved Orders have been placed, sub-tasks follow implementation.
[17:07:35] <wikibugs>	 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2956263 (10RobH) 05Open>03Resolved This task was actually filled months ago, and I neglected to clean up and resolve this (actual service implementation...
[17:15:06] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2956267 (10demon) >>! In T155764#2955567, @Marostegui wrote: > Given the size of the tables (very small ones) I would rather do the...
[17:17:57] <andrewbogott>	 !log graceful'd apache on silver, in hopes that the wikitech instance api will update
[17:18:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:14] <chasemp>	 !log shutdown eth1 on labstore1004 for testing
[17:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:20] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "i checked all the email addresses in that file with "exim4 -bt" and they are all deliverable Google addresses." [puppet] - 10https://gerrit.wikimedia.org/r/333259 (owner: 10Muehlenhoff)
[17:26:05] <wikibugs>	 (03PS2) 10Dzahn: Add a first batch of email addresses to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/333259 (owner: 10Muehlenhoff)
[17:27:22] <wikibugs>	 (03PS3) 10Dzahn: Add a first batch of email addresses to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/333259 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff)
[17:28:59] <wikibugs>	 06Operations, 13Patch-For-Review: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2547423 (10Dzahn) >>! In T142826#2955932, @MoritzMuehlenhoff wrote: > This can also be used as a data/synchronisation point whether someone is staff (i.e. if using a wikimedia.org address) or not (useful...
[17:31:10] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Add a first batch of email addresses to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/333259 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff)
[17:31:13] <wikibugs>	 06Operations, 13Patch-For-Review: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2956308 (10MoritzMuehlenhoff) >>! In T142826#2956304, @Dzahn wrote: >>>! In T142826#2955932, @MoritzMuehlenhoff wrote: >> This can also be used as a data/synchronisation point whether someone is staff (i...
[17:32:19] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2956310 (10MisterSynergy) >>! In T153563#2949967, @Smalyshev wrote: > each object has its own globally unique identifier, which is the ful...
[17:38:12] <wikibugs>	 06Operations, 13Patch-For-Review: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2956331 (10Krenair) @wikimedia.org isn't just staff mail, OTRS and probably other systems use that.
[17:39:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.44 [debs/linux44] - 10https://gerrit.wikimedia.org/r/333250 (owner: 10Muehlenhoff)
[17:40:42] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "yep, download.wm.o resolves to text-lb.eqiad so this should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/333272 (https://phabricator.wikimedia.org/T107575) (owner: 10Dzahn)
[17:41:06] <wikibugs>	 06Operations, 13Patch-For-Review: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2956338 (10MoritzMuehlenhoff) >>! In T142826#2956331, @Krenair wrote: > @wikimedia.org isn't just staff mail, OTRS and probably other systems use that.  That doesn't matter: Email addresses added here ar...
[17:44:16] <wikibugs>	 (03PS2) 10Dzahn: dumps: remove nginx for download.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/333272 (https://phabricator.wikimedia.org/T107575)
[17:45:51] <wikibugs>	 (03CR) 10Dzahn: [C: 032] dumps: remove nginx for download.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/333272 (https://phabricator.wikimedia.org/T107575) (owner: 10Dzahn)
[17:47:51] <wikibugs>	 (03CR) 10Dzahn: [C: 032] kartotherian: optional parameter listed before required [puppet] - 10https://gerrit.wikimedia.org/r/332956 (owner: 10Dzahn)
[17:59:37] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2956367 (10Cmjohnson)
[18:06:52] <wikibugs>	 (03PS2) 10Yuvipanda: labstore: check should search for exact mount match [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar)
[18:13:38] <wikibugs>	 (03PS1) 10Cmjohnson: Adding dhcpd entries for aqs1007-9 T155654 [puppet] - 10https://gerrit.wikimedia.org/r/333281
[18:26:05] <wikibugs>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2956444 (10matmarex) >>! In T155769#2955296, @Marostegui wrote: > Just to try to understand the whole picture, what is the impact of having that empty ro...
[18:26:33] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for aqs1007-9 T155654 [puppet] - 10https://gerrit.wikimedia.org/r/333281 (owner: 10Cmjohnson)
[18:35:53] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:36:03] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[18:37:48] <wikibugs>	 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2736876 (10scfc) (T152941 is slightly related, but refers to the case...
[18:39:53] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[18:40:03] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[18:40:41] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2956480 (10Smalyshev) @MisterSynergy using two ids means each time you want to query something related to this object, you need to do 2 qu...
[18:40:59] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2956481 (10Smalyshev) p:05Triage>03Low
[18:43:52] <wikibugs>	 (03CR) 10Madhuvishy: [C: 031] "looks good, let's look at merging next week once things have calmed down I think" [puppet] - 10https://gerrit.wikimedia.org/r/333227 (owner: 10Yuvipanda)
[18:47:59] <wikibugs>	 06Operations, 10Traffic: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2956496 (10ema)
[18:49:10] <wikibugs>	 (03CR) 10Tim Landscheidt: labstore: check should search for exact mount match (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar)
[18:55:22] <wikibugs>	 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2956511 (10Deskana) 05Open>03declined This seems to be a "it would be nice to investigate and sort this out", which doesn'...
[19:01:42] <wikibugs>	 (03CR) 10Volans: "@godog thanks for the review!" (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[19:10:48] <wikibugs>	 (03PS4) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643)
[19:12:02] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[19:14:33] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:15:33] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[19:17:13] <icinga-wm>	 PROBLEM - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:18:14] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2956567 (10Cmjohnson)
[19:18:17] <wikibugs>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2953992 (10Legoktm) Let's just delete it? Seems similar to T96233.
[19:28:08] <wikibugs>	 (03PS1) 10RobH: staging patch for ulsfo onsite work [dns] - 10https://gerrit.wikimedia.org/r/333288
[19:28:31] <wikibugs>	 (03CR) 10RobH: [C: 04-1] "no one should merge this, ill only use it if i mess up onsite work and have to depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/333288 (owner: 10RobH)
[19:28:33] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.286 second response time
[19:29:33] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.007 second response time
[19:29:52] <robh>	 !log cp4012 donating its redundant power supply to lvs4002 with redundant supplies
[19:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:10] <wikibugs>	 (03PS1) 10Alex Monk: Follow-up I94eb86ba: Ignore projects where we can't list instances [puppet] - 10https://gerrit.wikimedia.org/r/333289 (https://phabricator.wikimedia.org/T104575)
[19:33:14] <wikibugs>	 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2956591 (10RobH)
[19:33:17] <wikibugs>	 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2956589 (10RobH) 05Open>03Resolved Ok, it seems better to have two cp systems with a single PSU each than lose one entirely.  I double checked and @bblack agreed with that.  So now cp4012 has one...
[19:33:46] <wikibugs>	 06Operations, 10ops-codfw, 10ops-ulsfo: cp4008 power supply failure - https://phabricator.wikimedia.org/T151275#2956592 (10RobH) Also note that I had to take a PSU from cp4012 so it is also down to a single PSU
[19:35:02] <wikibugs>	 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2812572 (10RobH)
[19:35:04] <wikibugs>	 06Operations, 10ops-codfw, 10ops-ulsfo: cp4008 and cp4012 running on single PSU - https://phabricator.wikimedia.org/T151275#2956595 (10RobH) 05Open>03stalled p:05Normal>03Low
[19:35:46] <wikibugs>	 (03PS2) 10Alex Monk: Follow-up I94eb86ba: Ignore projects where we can't list instances [puppet] - 10https://gerrit.wikimedia.org/r/333289 (https://phabricator.wikimedia.org/T104575)
[19:36:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290
[19:37:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Follow-up I94eb86ba: Ignore projects where we can't list instances [puppet] - 10https://gerrit.wikimedia.org/r/333289 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[19:38:19] <wikibugs>	 (03CR) 10Alex Monk: [V: 032 C: 032] Update novaobserver passwd [labs/private] - 10https://gerrit.wikimedia.org/r/333290 (owner: 10Andrew Bogott)
[19:39:08] <wikibugs>	 06Operations, 10ops-ulsfo: atlas-ulsfo missing asset tag info in racktables - https://phabricator.wikimedia.org/T145141#2956605 (10RobH) 05Open>03Resolved WMF5802 tag added
[19:39:40] <wikibugs>	 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2956609 (10RobH) 05Open>03Resolved all fallout of this has been documented or resolved.
[19:40:29] <wikibugs>	 (03Abandoned) 10RobH: staging patch for ulsfo onsite work [dns] - 10https://gerrit.wikimedia.org/r/333288 (owner: 10RobH)
[19:43:51] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add fake clushuser keypair [labs/private] - 10https://gerrit.wikimedia.org/r/325050 (owner: 10Merlijn van Deen)
[19:44:08] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add tools hiera common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/325041 (owner: 10Merlijn van Deen)
[19:45:13] <icinga-wm>	 RECOVERY - puppet last run on install1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[19:45:35] <robh>	 !log messing with ulsfo serial connections
[19:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:22] <robh>	 paravoid: so its not the cable or the scs for cr1-ulsfo
[19:46:31] <robh>	 i just moved working serial for mr1-ulsfo to cr1-ulsfo
[19:46:33] <robh>	 and no output
[19:46:44] <robh>	 would have been nice if it was just the cable.
[19:47:39] <robh>	 oh wait
[19:48:09] <robh>	 now both cables work
[19:48:17] <robh>	 after reseating a second time... wtf oh well
[19:50:03] <icinga-wm>	 PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:51:43] <wikibugs>	 (03PS3) 10Andrew Bogott: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[19:54:39] <wikibugs>	 (03CR) 10Volans: "It's nice to be able to run all locally. I've tried this locally checking out this change." [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar)
[19:55:02] <wikibugs>	 06Operations, 10ops-ulsfo: cr1-ulsfo broken serial cable (or port) - https://phabricator.wikimedia.org/T147430#2956620 (10RobH) 05Open>03Resolved so the cable worked in anohter system, but not in cr1-ulsfo.  then it worked in cr1-ulsfo, but its initial routing had the cable folded over a bit too tightly....
[19:55:25] <robh>	 !log done fixing ulsfo serial in ulsfo
[19:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:03] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:06:45] <wikibugs>	 (03PS1) 10Urbanecm: Add *.finds.org.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333294 (https://phabricator.wikimedia.org/T155844)
[20:19:03] <icinga-wm>	 RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[20:34:03] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[20:46:20] <wikibugs>	 (03PS3) 10Ottomata: Configure RCFeeds to use EventBus extension in beta to send recentchange events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030)
[20:47:49] <wikibugs>	 (03PS4) 10Ottomata: Configure RCFeeds to use EventBus extension in beta to send recentchange events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030)
[20:49:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configure RCFeeds to use EventBus extension in beta to send recentchange events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata)
[20:51:03] <wikibugs>	 (03CR) 10Eevans: [C: 032] Prometheus JMX exporter deploy repository [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/332542 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans)
[20:53:26] <wikibugs>	 (03CR) 10Eevans: [V: 032 C: 032] Prometheus JMX exporter deploy repository [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/332542 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans)
[20:54:33] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.275 second response time
[20:55:33] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.478 second response time
[20:56:19] <wikibugs>	 (03PS1) 10Volans: varnishstatsd: temporary fix to avoid crashes [puppet] - 10https://gerrit.wikimedia.org/r/333296 (https://phabricator.wikimedia.org/T151643)
[20:59:31] <wikibugs>	 (03CR) 10Volans: [C: 032] "@ema merging this to unbreak production. It needs a better fix." [puppet] - 10https://gerrit.wikimedia.org/r/333296 (https://phabricator.wikimedia.org/T151643) (owner: 10Volans)
[20:59:39] <wikibugs>	 (03CR) 10Krinkle: Configure RCFeeds to use EventBus extension in beta to send recentchange events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata)
[21:00:01] <wikibugs>	 (03PS5) 10Ottomata: Configure RCFeeds to use EventBus extension in beta to send recentchange events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030)
[21:10:43] <wikibugs>	 (03CR) 10Eevans: [V: 032 C: 032] Prometheus JMX exporter deploy repository [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/332542 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans)
[21:18:55] <wikibugs>	 (03PS1) 10Thcipriani: Invalidate git index cache before smudging [debs/git-fat] - 10https://gerrit.wikimedia.org/r/333300 (https://phabricator.wikimedia.org/T147856)
[21:24:33] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:29:21] <wikibugs>	 (03CR) 10Chad: [C: 032] Invalidate git index cache before smudging [debs/git-fat] - 10https://gerrit.wikimedia.org/r/333300 (https://phabricator.wikimedia.org/T147856) (owner: 10Thcipriani)
[21:40:01] <wikibugs>	 06Operations, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#2956963 (10demon)
[21:52:33] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[22:13:13] <wikibugs>	 06Operations, 03Scap3: Package + deploy new version of git-fat - https://phabricator.wikimedia.org/T155856#2957067 (10thcipriani) `0.1.2-1` sounds reasonable to me. Changelog is reasonably thousandth-level-y  Changelog --- * Add whitespace for readability * Some code pep8-ifying * Ensured that a file touch pre...
[22:32:49] <Zppix>	 jouncebot now
[22:32:50] <jouncebot>	 No deployments scheduled for the next 63 hour(s) and 27 minute(s)
[22:44:13] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:12:13] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[23:21:45] <wikibugs>	 (03CR) 10Mattflaschen: [C: 032] Improve dblist name coherence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson)
[23:23:57] <wikibugs>	 (03Merged) 10jenkins-bot: Improve dblist name coherence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson)
[23:24:12] <wikibugs>	 (03CR) 10jenkins-bot: Improve dblist name coherence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson)
[23:29:33] <icinga-wm>	 PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:32:39] <ostriches>	 matt_flaschen: Are you planning to deploy ^?
[23:33:13] <matt_flaschen>	 ostriches, yes, sorry.
[23:33:17] <matt_flaschen>	 Will do it right now.
[23:33:18] <ostriches>	 No worries, just checking :)
[23:36:23] <logmsgbot>	 !log mattflaschen@tin Synchronized dblists: No-op file rename (duration: 00m 54s)
[23:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:18] <logmsgbot>	 !log mattflaschen@tin Synchronized docroot: No-op file rename (duration: 00m 46s)
[23:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:43] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479
[23:45:33] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[23:45:43] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3062162 keys, up 81 days 15 hours - replication_delay is 0
[23:46:23] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3062092 keys, up 81 days 15 hours - replication_delay is 0
[23:57:33] <icinga-wm>	 RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures