[00:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:24] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1
[00:01:13] <wikibugs>	 (03PS11) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729
[00:04:08] <wikibugs>	 (03PS12) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729
[00:07:10] <wikibugs>	 (03PS13) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729
[00:08:14] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "finally http://puppet-compiler.wmflabs.org/9678/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn)
[00:09:24] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3888608 (10aaron) I fixed a stupid hostname var bug. Now I get numbers that make sense: ``` Same-DC (db2070.codfw.wmnet): stri...
[00:09:59] <wikibugs>	 (03CR) 10Dzahn: "also fixes 3 x Parameter 'languages' of class 'profile.. ' has no call to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn)
[00:17:28] <mutante>	 a reboot of phabricator server is imminent
[00:18:13] <wikibugs>	 (03PS1) 10Alex Monk: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326
[00:18:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[00:18:54] <mutante>	 !log rebooting phabricator server for kernel upgrade
[00:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:42] <wikibugs>	 (03PS2) 10Alex Monk: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326
[00:20:46] <Krenair>	 is it me, or is git-review broken?
[00:21:15] <mutante>	 Krenair: i just used it 
[00:21:35] <Krenair>	 must be my setup then
[00:22:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[00:23:08] <mutante>	 Krenair: 1.25.0-2
[00:23:33] <Krenair>	 actually I'm having trouble pulling from origin too
[00:23:49] <mutante>	 i touched phab but not gerrit 
[00:23:51] <mutante>	 yet
[00:23:58] <Krenair>	 though pushing that commit was fine as I just pushed to refs/for/production directly without bothering with git-review
[00:24:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy
[00:24:21] <mutante>	 it's a good time because i also have to reboot gerrit , heh
[00:24:29] <Krenair>	 heh
[00:24:40] <Krenair>	 more likely the problem is on my end
[00:24:41] <mutante>	 i mean.. better now than thinking it was related
[00:25:03] <Krenair>	 yeah
[00:26:30] <mutante>	 Krenair: there was something recently that had to be fixed about it
[00:26:37] <mutante>	 and then they did.. and it worked again
[00:26:47] <Krenair>	 fixed about what exactly?
[00:26:54] <mutante>	 maybe related to the Gerrit URL with or without /r/p  vs.  just /p
[00:26:56] <mutante>	 ehm..
[00:27:18] <Krenair>	 attempting to pull from https://gerrit.wikimedia.org/r/p/operations/puppet rather than my configured (ssh) origin is also just sitting there looking at me
[00:27:37] <Krenair>	 https://gerrit.wikimedia.org/p/operations/puppet is not found
[00:29:48] <mutante>	 Krenair: it exists with /r/p/
[00:30:08] <Krenair>	 yeah except my client just does nothing
[00:30:13] <mutante>	 i just have it configured with ssh
[00:30:31] <mutante>	 paladox knows this :)
[00:30:54] <mutante>	 afair
[00:31:14] <Krenair>	 meh
[00:31:18] <Krenair>	 I'll look at it some other day
[00:31:55] <Krenair>	 https://gerrit.wikimedia.org/r/#/c/403326/ is gonna need some legal review or something
[00:32:20] <Krenair>	 though the changes don't look very big, I don't know how to arrange it. I assume the reviewers do
[00:32:25] <mutante>	 i'll have the answer tomorrow, heh
[00:32:54] <mutante>	 you got the right reviewers, yes
[00:34:50] <paladox>	 mutante heh
[00:34:56] <paladox>	 Krenair update git-review :)
[00:35:04] <paladox>	 it includes a fix for this.
[00:35:11] <mutante>	 paladox: ^ thanks, i rememberd the issue was there
[00:35:13] <Krenair>	 it's not just git-review having problems
[00:35:17] <Krenair>	 it appears to be my git client
[00:35:37] <mutante>	 paladox: https in git config?
[00:35:45] <mutante>	 with the /r/ and r/p  thing
[00:36:18] <mutante>	 i think my git-review is old enough to be before the issue 
[00:36:27] <mutante>	 but the latest has it fixed again
[00:36:43] <mutante>	 i installed from distro, not pip
[00:37:23] <paladox>	 Maybe another bug? as the one that was fixed was /changes/ but the actual fix is https://review.openstack.org/#/c/478325/
[00:37:54] <Krenair>	 oh there we go
[00:38:00] <Krenair>	 it took a while but git pull eventually worked
[00:38:01] <mutante>	 though he says just git client by itself too
[00:38:04] <mutante>	 ah
[00:38:21] <paladox>	 heh
[00:38:28] <no_justification>	 Old git is sad git
[00:38:31] <mutante>	 jouncebot: next
[00:38:31] <jouncebot>	 In 13 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400)
[00:43:45] <mutante>	 !log rebooting gerrit server for kernel upgrade
[00:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.12, 22.51, 23.88
[00:46:48] <mutante>	 gerrit back
[00:51:24] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[00:54:33] <mutante>	 ^ side-effect of gerrit reboot, just a sec
[00:56:24] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[01:23:10] <wikibugs>	 (03CR) 10Gergő Tisza: "The line looks good. Not sure where I should check (or even if I have access), https://wikitech.wikimedia.org/wiki/Cron_jobs is not very i" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza)
[01:32:49] <wikibugs>	 (03PS2) 10Dzahn: mariadb::tendril: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/397725
[01:34:11] <wikibugs>	 (03Abandoned) 10Dzahn: mariadb::tendril: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/397725 (owner: 10Dzahn)
[01:36:00] <wikibugs>	 (03PS2) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726
[01:36:34] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.42, 34.88, 32.10
[01:38:34] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.40, 33.97, 32.09
[01:39:54] <mutante>	 !log mw1226 - high load - hhvm-dump-debug > /root/hhvm-dump-debug-20170109-1739PST.log ; restart-hhvm
[01:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:00] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/9679/druid1002.eqiad.wmnet/change.druid1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn)
[01:42:09] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "why is this even related?  Error: Could not find resource 'Exec[apt-get update]' for relationship from 'Class[Profile::Cdh::Apt]'" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn)
[01:47:34] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.70, 15.67, 23.87
[02:10:34] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379
[02:11:34] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7815375 keys, up 5 minutes 20 seconds - replication_delay is 0
[02:23:24] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 49.04, 26.51, 20.89
[02:24:27] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 06m 02s)
[02:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1201 is OK: OK - load average: 18.06, 24.89, 21.90
[03:26:24] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 811.39 seconds
[03:38:23] <wikibugs>	 (03PS1) 10KartikMistry: apertium-cat: New upstream and updated dependency on cg3 [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/403339 (https://phabricator.wikimedia.org/T171406)
[03:38:46] <wikibugs>	 (03Abandoned) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/397223 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[03:39:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream and updated dependency on cg3 [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/403339 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[03:41:08] <wikibugs>	 (03Abandoned) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/397224 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[03:51:59] <wikibugs>	 (03PS1) 10KartikMistry: apertium-cat-srd: New upstream and updated dependencies [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/403340 (https://phabricator.wikimedia.org/T171406)
[03:53:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-cat-srd: New upstream and updated dependencies [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/403340 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[04:04:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.72 seconds
[05:01:40] <wikibugs>	 (03Draft2) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342
[05:02:10] <wikibugs>	 (03PS3) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579)
[05:05:05] <wikibugs>	 (03PS4) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579)
[05:12:23] <wikibugs>	 (03CR) 10Jayprakash12345: "@SWAT, You can merge the task. Because we cant test it on mwdebug. So go ahead even if I am not around on Wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345)
[05:23:47] <wikibugs>	 (03PS1) 10KartikMistry: apertium-srd-ita: Updated cg3 dependency [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/403344 (https://phabricator.wikimedia.org/T171406)
[05:24:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-srd-ita: Updated cg3 dependency [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/403344 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[05:28:55] <wikibugs>	 (03PS1) 10KartikMistry: apertium-swe: Updated dependency on cg3 [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/403345 (https://phabricator.wikimedia.org/T171406)
[05:29:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-swe: Updated dependency on cg3 [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/403345 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[05:32:49] <wikibugs>	 (03PS1) 10KartikMistry: apertium-swe-dan: updated dependency on cg3 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/403346 (https://phabricator.wikimedia.org/T171406)
[05:33:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-swe-dan: updated dependency on cg3 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/403346 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[05:35:23] <wikibugs>	 (03PS1) 10KartikMistry: apertium-swe-nor: Updated dependency on cg3 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/403347 (https://phabricator.wikimedia.org/T171406)
[05:35:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-swe-nor: Updated dependency on cg3 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/403347 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[05:41:48] <wikibugs>	 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018), 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3888890 (10Tgr)
[05:42:47] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3888892 (10Tgr)
[05:43:05] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3888893 (10Tgr)
[05:58:06] <wikibugs>	 (03PS1) 10Urbanecm: Update officewiki logo, add HD logo for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403349 (https://phabricator.wikimedia.org/T184575)
[05:58:16] <wikibugs>	 (03PS3) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845)
[06:01:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345)
[06:02:37] <wikibugs>	 (03CR) 10Urbanecm: "> In dblists/all.dblist, inhwiki should come before internalwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm)
[06:03:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM, technically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395775 (https://phabricator.wikimedia.org/T182201) (owner: 10MarcoAurelio)
[06:04:27] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3888925 (10Tgr) >>! In T180854#3882018, @Qgil wrote: > If replying via email is a wanted feature, then it should be discussed in a separate task blocking {T180853}. I will...
[06:04:45] <wikibugs>	 (03PS2) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845)
[06:06:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM, technically. Not sure about the EDP and everything else needed for enabling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes)
[06:07:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "Technically ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio)
[06:13:20] <wikibugs>	 (03PS1) 10KartikMistry: apertium-tat: Updated dependency on cg3 [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/403350 (https://phabricator.wikimedia.org/T171406)
[06:13:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-tat: Updated dependency on cg3 [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/403350 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[06:15:49] <wikibugs>	 (03PS1) 10KartikMistry: apertium-tur: Updated dependency on cg3 [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/403351 (https://phabricator.wikimedia.org/T171406)
[06:16:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-tur: Updated dependency on cg3 [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/403351 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[06:17:11] <marostegui>	 !log Deploy schema change on s5 codfw master (db2052) with replication (this will generate lag on codfw) - T174569
[06:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:24] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:17:51] <wikibugs>	 (03PS1) 10KartikMistry: apertium-urd: Updated dependency on cg3 [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/403352 (https://phabricator.wikimedia.org/T171406)
[06:18:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-urd: Updated dependency on cg3 [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/403352 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[06:19:49] <wikibugs>	 (03PS1) 10KartikMistry: apertium-urd-hin: Updated dependency on cg3 [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/403353 (https://phabricator.wikimedia.org/T171406)
[06:20:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-urd-hin: Updated dependency on cg3 [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/403353 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry)
[06:27:54] <wikibugs>	 (03CR) 10Marostegui: [C: 031] wikireplicas: Add partial index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/388572 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis)
[06:36:49] <wikibugs>	 (03PS1) 10Andrew Bogott: vmbuilder: include linux-image-generic in trusty base image [puppet] - 10https://gerrit.wikimedia.org/r/403355
[06:38:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] vmbuilder: include linux-image-generic in trusty base image [puppet] - 10https://gerrit.wikimedia.org/r/403355 (owner: 10Andrew Bogott)
[06:39:54] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3888967 (10Marostegui) I believe we are good to close this task after Bryan finished with the pending Cloud Team's tasks?
[06:47:05] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3888968 (10Andrew) I've build new base images, and I'm concerned about what I'm seeing for Jessie.  Trusty:   ``` andrew@trusty-meltdown-image...
[06:56:08] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3888980 (10Andrew) Here are all the distros and kernels currently running:  P6565
[07:26:58] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807)
[07:37:08] <marostegui>	 !log Drop external_user from wikidatawiki - T184247
[07:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:21] <stashbot>	 T184247: Drop `external_user` from all databases - https://phabricator.wikimedia.org/T184247
[07:44:05] <moritzm>	 !log rebooting mw1262-mw1275 for kernel security update (along with update to HHVM 3.18.6)
[07:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:17] <wikibugs>	 (03PS41) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956)
[07:55:05] <wikibugs>	 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889014 (10Bawolff)
[08:13:44] <marostegui>	 !log Deploy schema change on s5 dbstore1002 - T174569
[08:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:57] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[08:28:13] <hashar>	 !log contint1001: upgraded Zuul 2.5.0-8-gcbc7f62-wmf4jessie1 .. 2.5.0-8-gcbc7f62-wmf6 | T158243
[08:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:26] <stashbot>	 T158243: Update zuul to upstream master - https://phabricator.wikimedia.org/T158243
[08:29:31] <hashar>	 there is a file on tin: modified:   /srv/mediawiki-staging/wikiversions.json
[08:29:37] <hashar>	 looking into it for marostegui 
[08:29:49] <marostegui>	 thanks hashar
[08:30:47] <hashar>	 ah that is twentyafterfour that did the deploy yesterday. T180749
[08:30:48] <stashbot>	 T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749
[08:33:49] <moritzm>	 !log rebooting mw1299-mw1306 (job runners) for kernel security update (along with update to HHVM 3.18.6)
[08:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:46] <wikibugs>	 (03PS1) 10Hashar: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360
[08:36:07] <wikibugs>	 (03CR) 10Hashar: [C: 032] group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar)
[08:36:54] <hashar>	 marostegui: ^^that would fix it
[08:37:14] <hashar>	 group0 got updated but the wikiversions.json has been left uncommited for some reason
[08:37:19] <marostegui>	 hashar: ah great :)
[08:37:36] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar)
[08:37:48] <wikibugs>	 (03CR) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar)
[08:37:59] <hashar>	 marostegui: should be good now :]
[08:38:01] <marostegui>	 hashar: it is now gone - thanks! :)
[08:38:14] <marostegui>	 !log Deploy schema change on s5 dbstore1001 - T174569
[08:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:26] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[08:38:40] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:40:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:40:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:41:32] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 and db1089 - T162807 (duration: 01m 05s)
[08:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:45] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[08:42:41] <marostegui>	 !log Stop replication in sync on db1089 and db1067 - T162807
[08:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:25] <wikibugs>	 (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar)
[08:53:33] <wikibugs>	 (03CR) 10Hashar: "recheck" [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar)
[08:57:52] <wikibugs>	 (03CR) 10Hashar: "recheck" [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar)
[08:59:16] <wikibugs>	 (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130
[09:01:19] <wikibugs>	 (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) (jessie) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130
[09:01:29] <wikibugs>	 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3889187 (10fgiunchedi) Thanks a lot @Dzahn for taking care of this!
[09:02:48] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) (jessie) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar)
[09:12:53] <moritzm>	 !log rebooting radium (tor relay) for kernel security update
[09:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:19] <marostegui>	 !log Deploy schema change on db1051 - T174569
[09:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:29] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[09:18:07] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen
[09:21:05] <wikibugs>	 10Operations, 10Dumps-Generation: Reboot snapshot*, dumpsdata*, dataset1001, ms1001, francium - https://phabricator.wikimedia.org/T184443#3889205 (10MoritzMuehlenhoff) Fixed kernels are available for trusty now, I've installed them on francium and snapshot100[1,5-7].
[09:21:07] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen
[09:27:26] <godog>	 !log stop restbase on cassandra 2 nodes - T184100
[09:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:38] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[09:32:02] <marostegui>	 !log Upgrade kernel on db1067
[09:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:21] <wikibugs>	 (03CR) 10Ema: "LGTM in general, I've added a couple comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles)
[09:39:12] <ema>	 !log eqiad LVSs: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267
[09:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:26] <stashbot>	 T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656
[09:40:34] <moritzm>	 !log rebooting kubernetes workers (plus staging hosts) for kernel security update
[09:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While the code is overall correct, I'm not convinced by its organization. I'd try to make the role/profile move first and rebase this chan" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey)
[09:50:07] <godog>	 !log shut cassandra 2 on restbase legacy nodes - T184100
[09:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:19] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[09:51:23] <wikibugs>	 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3889259 (10Qgil) p:05Triage>03Normal
[09:53:01] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3889275 (10Qgil) @Tgr indeed: {T184592}
[10:00:20] <wikibugs>	 (03PS1) 10Ladsgroup: statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463)
[10:02:59] <moritzm>	 !log rebooting tegmen for kernel security update
[10:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:45] <elukey>	 !log rebooting analytics1035 (hadoop worker node and hdfs journal node) for kernel updates
[10:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:22] <wikibugs>	 (03PS1) 10Volans: Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353)
[10:09:33] <wikibugs>	 (03PS1) 10Volans: Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353)
[10:11:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3889328 (10jcrespo) Please send us a 15 minute meeting invite, there are some things that we need to discuss regarding dbstores for you to talk to analytics and other dbstore users. T...
[10:12:42] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar)
[10:13:05] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar)
[10:13:49] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372
[10:14:00] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372 (owner: 10Hashar)
[10:14:21] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar)
[10:14:52] <wikibugs>	 (03PS2) 10Filippo Giunchedi: decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100)
[10:14:57] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373
[10:15:08] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar)
[10:15:26] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar)
[10:15:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans)
[10:16:16] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar)
[10:16:35] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/142517 (owner: 10Hashar)
[10:16:51] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar)
[10:16:55] <moritzm>	 !log rebooting bast4001 for kernel security update
[10:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:11] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar)
[10:17:24] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:17:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[10:17:52] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar)
[10:18:00] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar)
[10:18:10] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar)
[10:19:12] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar)
[10:19:15] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:19:31] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/142517 (owner: 10Hashar)
[10:21:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: site: spare::system vs system::spare [puppet] - 10https://gerrit.wikimedia.org/r/403376
[10:21:56] <wikibugs>	 (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:22:02] <wikibugs>	 (03PS5) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436
[10:22:23] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:22:35] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar)
[10:22:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] site: spare::system vs system::spare [puppet] - 10https://gerrit.wikimedia.org/r/403376 (owner: 10Filippo Giunchedi)
[10:22:49] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar)
[10:22:59] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar)
[10:23:03] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar)
[10:23:06] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar)
[10:23:12] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar)
[10:23:14] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar)
[10:23:19] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar)
[10:23:22] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372 (owner: 10Hashar)
[10:23:29] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar)
[10:23:40] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar)
[10:23:44] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar)
[10:23:55] <wikibugs>	 (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar)
[10:29:28] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:29:32] <wikibugs>	 (03PS5) 10Faidon Liambotis: rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485
[10:29:40] <godog>	 !log reimage restbase1011 to test HBA mode - T184100
[10:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:53] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[10:30:00] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 (owner: 10Faidon Liambotis)
[10:31:49] <wikibugs>	 (03PS6) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436
[10:32:00] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:32:04] <wikibugs>	 (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar)
[10:32:24] <wikibugs>	 (03CR) 10Addshore: [C: 031] statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup)
[10:33:02] <wikibugs>	 (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906
[10:33:06] <wikibugs>	 (03CR) 10Addshore: [C: 04-1] "this should probably be within the statistics::wmde::graphite class? Thats is where the requirement for php actually comes in (via the scr" [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup)
[10:33:16] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar)
[10:33:31] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar)
[10:36:56] <wikibugs>	 (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar)
[10:36:59] <wikibugs>	 (03PS4) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434
[10:37:10] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar)
[10:38:29] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar)
[10:38:50] <wikibugs>	 (03PS1) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378
[10:40:24] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar)
[10:41:56] <wikibugs>	 (03PS2) 10Ladsgroup: statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463)
[10:42:00] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks. Fixed." [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup)
[10:42:35] <wikibugs>	 (03PS1) 10Ema: pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380
[10:43:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/403378 (owner: 10Faidon Liambotis)
[10:49:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans)
[10:49:59] <wikibugs>	 (03PS2) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378
[10:52:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380 (owner: 10Ema)
[10:52:42] <volans>	 I'm about to failover the Icinga server to tegment (passive server) in about 5 minutes. If there is anything ongoing let me know and I can postpone it
[10:53:04] <volans>	 *tegmen ofc
[10:55:32] <elukey>	 !log reboot analytics1040->43 for kernel updates
[10:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:33] <elukey>	 volans: do I need to do anything about the maintenance that I've just scheduled for --^? (ignorant question)
[10:56:35] <wikibugs>	 (03CR) 10Ema: [C: 032] pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380 (owner: 10Ema)
[10:57:15] <volans>	 elukey: no I will sync the files when failovering, but if you want I can wait your 3 reboots
[10:57:53] <elukey>	 nono because I need to drain those hosts first, all good
[10:58:19] <elukey>	 thanks :)
[10:58:38] <icinga-wm>	 PROBLEM - Host wtp2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:58:44] <volans>	 anyway is a good test for the procedure, I'll check that the downtime is still there
[10:59:08] <icinga-wm>	 PROBLEM - Host wtp2002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:59:18] <icinga-wm>	 RECOVERY - Host wtp2002 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms
[10:59:19] <icinga-wm>	 RECOVERY - Host wtp2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[11:04:09] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time
[11:05:09] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time
[11:06:17] <wikibugs>	 (03CR) 10Volans: [C: 032] Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans)
[11:06:22] <wikibugs>	 (03PS2) 10Volans: Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353)
[11:07:57] <volans>	 !log start failovering of Icinga to tegmen - T170353
[11:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:09] <stashbot>	 T170353: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353
[11:10:36] <wikibugs>	 (03CR) 10Volans: [C: 032] Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans)
[11:11:52] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3889432 (10jcrespo) So actually, that is not really that bad- query times are similar (only some small overhead), connection t...
[11:12:29] <moritzm>	 !log migrating instances off ganeti2008 for subsequent reboot for kernel security update
[11:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:34] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3889438 (10jcrespo) One thing I just realized is that there could be some connection overhead on db1055- I will (or you can) t...
[11:18:58] <icinga-wm>	 PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[11:19:19] <ema>	 that's me ^
[11:19:31] <volans>	 !log Icinga failover to tegmen completed - T170353
[11:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:45] <stashbot>	 T170353: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353
[11:19:48] <icinga-wm>	 RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms
[11:19:50] <volans>	 the ACTIVE Icinga server is now tegmen
[11:22:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569)
[11:23:28] <icinga-wm>	 PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - T170353 - volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen
[11:23:48] <moritzm>	 !log migrating instances off ganeti2007 for subsequent reboot for kernel security update
[11:23:50] <volans>	 akosiaris: there you go! ampersends are there :D ^^^
[11:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:08] <icinga-wm>	 PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100%
[11:24:39] <icinga-wm>	 RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms
[11:24:47] <ema>	 still me, sorry about that (I've downtimed the hosts on the wrong icinga server hehe) ^
[11:25:14] <volans>	 TTL? :D
[11:26:04] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3889454 (10Volans) Confirmed that on `tegmen` it works fine after failovering the active Icinga server to it. The links are properly rendered and...
[11:26:19] <elukey>	 !log reboot analytics1044->47 for kernel updates
[11:26:22] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3889455 (10jcrespo) 05Open>03Resolved a:03jcrespo yes, but let's open one for followup/clean up - delete, which we will want to wait to do (leave data there for a few...
[11:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:42] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[11:32:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[11:32:28] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[11:33:03] <wikibugs>	 (03PS2) 10Elukey: Standardize Analytics jmx agent's configurations [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458)
[11:33:08] <icinga-wm>	 RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen
[11:33:33] <marostegui>	 !log Deploy schema change on db1106 - T174569
[11:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:44] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[11:38:33] <akosiaris>	 volans: I am speechless and have no idea
[11:38:45] <akosiaris>	 I suppose you already did a diff the configs ?
[11:39:21] <volans>	 yes, IIRC I did diff the whole /etc/icinga, don't remeber if I did also /etc/nagios, I can redo both
[11:39:33] <volans>	 or whole /etc :D
[11:39:55] <moritzm>	 !log rebooting mw1201-mw1208 for kernel security update (along with update to HHVM 3.18.6)
[11:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:41:18] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:18] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:18] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:18] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:19] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:19] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:20] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused
[11:41:20] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused
[11:41:21] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:21] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.130:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:22] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:22] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:23] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused
[11:41:34] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.177 and port 9042: Connection refused
[11:41:34] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused
[11:41:35] <icinga-wm>	 PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:41:35] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:41:36] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.143:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:36] <icinga-wm>	 PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:41:37] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:41:37] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:38] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:41:38] <icinga-wm>	 PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.205 and port 9042: Connection refused
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.139:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused
[11:41:41] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:41:42] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.205:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:42] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.130:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.130 and port 9042: Connection refused
[11:41:54] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.143 and port 9042: Connection refused
[11:41:54] <icinga-wm>	 PROBLEM - Restbase root url on restbase2007 is CRITICAL: connect to address 10.192.16.175 and port 7231: Connection refused
[11:41:55] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:41:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:56] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:41:56] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:41:58] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:41:58] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused
[11:41:58] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[11:41:58] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:41:59] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:41:59] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:42:00] <icinga-wm>	 PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:42:00] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:42:01] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[11:42:01] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:42:02] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused
[11:42:02] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:42:18] <icinga-wm>	 PROBLEM - Restbase root url on restbase2011 is CRITICAL: connect to address 10.192.32.151 and port 7231: Connection refused
[11:42:18] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused
[11:42:18] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:42:19] <icinga-wm>	 PROBLEM - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused
[11:42:36] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243)
[11:42:59] <wikibugs>	 (03CR) 10Aklapper: [C: 04-1] "Please see https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines and fix the commit message format (imperative form; length of l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب)
[11:45:05] <godog>	 !log downtime decomissioned restbase cassandra 2 hosts
[11:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:08] <icinga-wm>	 PROBLEM - Host wtp2001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:48:19] <icinga-wm>	 RECOVERY - Host wtp2001 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms
[11:51:21] <moritzm>	 !log migrating instances off ganeti2006 for subsequent reboot for kernel security update
[11:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:31] <wikibugs>	 (03CR) 10Elukey: [C: 032] "After a chat with Gehel I decided to proceed anyway since I don't have a ton of mbeans to inspect in my jvms. We started to collect info a" [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey)
[12:00:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[12:01:36] <wikibugs>	 10Operations, 10Patch-For-Review: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3816772 (10faidon) @Dzahn, yes, that sounds like a good idea. Please do :)
[12:01:54] <icinga-wm>	 PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid]
[12:02:11] <moritzm>	 ^acrux is transient
[12:02:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100)
[12:03:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[12:04:14] <icinga-wm>	 PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:04:26] <_joe_>	 uh ganeti down
[12:04:44] <_joe_>	 ah see log by moritz, ok
[12:05:08] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[12:05:14] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[12:05:14] <icinga-wm>	 RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[12:06:12] <moritzm>	 yeah, for some reason my downtime had vanished
[12:06:51] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[12:07:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[12:08:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[12:10:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390
[12:11:08] <moritzm>	 !log rebooting einsteinium for kernel security update
[12:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:00] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[12:12:55] <icinga-wm>	 PROBLEM - Host wtp2007 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:05] <icinga-wm>	 PROBLEM - Host wtp2004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:24] <icinga-wm>	 RECOVERY - Host wtp2007 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[12:13:34] <icinga-wm>	 RECOVERY - Host wtp2004 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[12:13:43] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "Added in deployments list for European Mid-day SWAT today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[12:17:42] <wikibugs>	 (03CR) 10TerraCodes: "You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[12:19:26] <moritzm>	 !log migrating instances off ganeti2005 for subsequent reboot for kernel security update
[12:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:15] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "> You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[12:20:45] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[12:21:30] <wikibugs>	 (03CR) 10TerraCodes: "> > You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[12:23:00] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "> > > You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[12:31:52] <icinga-wm>	 RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[12:37:52] <icinga-wm>	 PROBLEM - Host wtp2012 is DOWN: PING CRITICAL - Packet loss = 100%
[12:38:12] <icinga-wm>	 PROBLEM - Host wtp2013 is DOWN: PING CRITICAL - Packet loss = 100%
[12:38:42] <icinga-wm>	 RECOVERY - Host wtp2012 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[12:38:52] <moritzm>	 !log migrating instances off ganeti2004 for subsequent reboot for kernel security update
[12:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:00] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390
[12:42:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui)
[12:44:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui)
[12:46:04] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T174569 (duration: 01m 03s)
[12:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:15] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[12:47:04] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui)
[12:47:31] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569)
[12:50:21] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[12:52:12] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[12:53:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[12:53:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[12:54:00] <marostegui>	 !log Deploy schema change on db1097:3315 - https://phabricator.wikimedia.org/T174569
[12:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:44] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T174569 (duration: 01m 03s)
[12:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:56] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[12:55:37] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@a2aabfb]: API: add top-by-country, change recommendation route, fix duplicates in onthisday - T181520 T170877 T175974
[12:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:49] <stashbot>	 T175974: [BUG] On this day occasionally duplicates events - https://phabricator.wikimedia.org/T175974
[12:55:49] <stashbot>	 T181520: Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520
[12:55:49] <stashbot>	 T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877
[12:55:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "This works correctly in production, as seen here https://puppet-compiler.wmflabs.org/compiler02/9684/ but I still need to fix labs before " [puppet] - 10https://gerrit.wikimedia.org/r/403388 (owner: 10Giuseppe Lavagetto)
[13:03:15] <icinga-wm>	 PROBLEM - Host wtp2010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:03:15] <icinga-wm>	 PROBLEM - Host wtp2017 is DOWN: PING CRITICAL - Packet loss = 100%
[13:03:37] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@a2aabfb]: API: add top-by-country, change recommendation route, fix duplicates in onthisday - T181520 T170877 T175974 (duration: 08m 00s)
[13:03:44] <icinga-wm>	 RECOVERY - Host wtp2017 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms
[13:03:44] <icinga-wm>	 RECOVERY - Host wtp2010 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[13:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:51] <stashbot>	 T175974: [BUG] On this day occasionally duplicates events - https://phabricator.wikimedia.org/T175974
[13:03:51] <stashbot>	 T181520: Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520
[13:03:51] <stashbot>	 T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877
[13:08:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[13:15:03] <_joe_>	 wat?
[13:15:08] <_joe_>	 ema: ^^
[13:21:21] <_joe_>	 cannot reproduce it ftr
[13:22:31] <_joe_>	 oh now I can
[13:25:23] <Niharika>	 jouncebot: next
[13:25:24] <jouncebot>	 In 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400)
[13:26:01] <_joe_>	 !log restarting pybal on lvs2003
[13:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[13:28:15] <icinga-wm>	 PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100%
[13:28:44] <icinga-wm>	 RECOVERY - Host wtp2019 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[13:31:54] <akosiaris>	 that's me ^
[13:32:21] <akosiaris>	 downtime expired
[13:34:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:34:32] <godog>	 that's me, reimaged machine
[13:37:06] <moritzm>	 !log migrating instances off ganeti2003 for subsequent reboot for kernel security update
[13:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:33] <icinga-wm>	 PROBLEM - Host elastic2008 is DOWN: PING CRITICAL - Packet loss = 100%
[13:43:16] <wikibugs>	 (03PS2) 10MarcoAurelio: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314)
[13:44:04] <icinga-wm>	 RECOVERY - Host elastic2008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:44:16] <jynus>	 I do not see the elastic ones on SAL
[13:44:52] <jynus>	 I assume it is part of yesterdays work?
[13:45:17] <jynus>	 *work started yesterday
[13:45:22] <dcausse>	 jynus: yes rolling restarts usually take 2/3 days
[13:45:28] <jynus>	 thanks
[13:45:32] <jynus>	 not complaining
[13:45:42] <jynus>	 just wanted to make sure it wasn't a crash
[13:45:48] <dcausse>	 sure, np
[13:46:44] <icinga-wm>	 PROBLEM - Host elastic2007 is DOWN: PING CRITICAL - Packet loss = 100%
[13:47:37] <gehel>	 jynus: thanks for the check! I'm checking why those were not downtimed correctly by my script...
[13:48:04] <icinga-wm>	 RECOVERY - Host elastic2007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:48:46] <Hauskatze>	 hi brion - if https://gerrit.wikimedia.org/r/#/c/401965/ looks good to you now, can you remove your -2?
[13:50:15] <gehel>	 strange, I do have the set downtime in my logs...
[13:50:28] <volans>	 gehel: how do you downtime them?
[13:51:11] <gehel>	 volans: icinga-downtime on einsteinium
[13:51:21] <volans>	 eheheh
[13:51:30] <volans>	 we failovered to tegmen today (temporarily)
[13:51:45] <gehel>	 Ah, I missed that one. That explains!
[13:51:51] <volans>	 you should use icinga.w.o that is ofc updated
[13:52:03] <gehel>	 volans: thanks! 
[13:52:04] <volans>	 I'm sorry for the trouble, any way I can help?
[13:52:58] <gehel>	 but SSH to icinga.w.o isn't possible... or am I missing something?
[13:53:20] <volans>	 from your local computer?
[13:53:24] <gehel>	 yep
[13:53:52] <volans>	 yes it is if using my script that generates the right entries in the known hosts file ;)
[13:54:13] <icinga-wm>	 PROBLEM - Host wtp2018 is DOWN: PING CRITICAL - Packet loss = 100%
[13:54:17] <gehel>	 I should of course migrate those ugly scripts to a proper cumin tool :)
[13:54:24] <icinga-wm>	 PROBLEM - Host wtp2015 is DOWN: PING CRITICAL - Packet loss = 100%
[13:54:25] <volans>	 to use the host strict check 
[13:54:44] <icinga-wm>	 RECOVERY - Host wtp2015 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[13:54:50] <volans>	 gehel: indeed, but I guess I'm also a blocker on that for the switchdc spinoff ;)
[13:54:52] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183896#3889763 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Thanks @Cmjohnson ! Disk rebuilding
[13:54:53] <icinga-wm>	 RECOVERY - Host wtp2018 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:55:38] <gehel>	 volans: yep you are :) (but I have plenty of other excuses for not moving forward on that, don't blame yourself)
[13:56:23] <volans>	 thanks for sharing the blame :-P
[13:57:07] <volans>	 but I'm a blocker for *any* of those, so it's fair I get a bigger share of the blame ;)
[13:57:09] <wikibugs>	 (03PS1) 10Legoktm: contint: Lower caching length on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/403401 (https://phabricator.wikimedia.org/T184255)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400).
[14:00:04] <jouncebot>	 Jayprakash12345, Zoranzoki21, and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:13] <Hauskatze>	 o/
[14:00:16] <Hauskatze>	 I'm here
[14:00:18] <zeljkof>	 o/
[14:00:33] <zeljkof>	 I can SWAT today
[14:02:04] <volans>	 gehel: FYI we'll get back to einsteinium by EOD most likely (or tomorrow morning at most)
[14:02:23] <gehel>	 volans: thanks! I'll add a check...
[14:02:45] <Hauskatze>	 zeljkof: if the others ain't around we maybe can start with mine?
[14:03:15] <zeljkof>	 Hauskatze: I'll deploy the 403342 first, since there is nothing to test there
[14:03:29] <Hauskatze>	 cook, k
[14:03:51] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345)
[14:05:36] <Zoranzoki21>	 Hi, I am here.. Is started swat?
[14:05:56] <moritzm>	 !log migrating instances off ganeti2002 for subsequent reboot for kernel security update
[14:06:07] <Hauskatze>	 Zoranzoki21: yes, you're next
[14:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:19] <Zoranzoki21>	 OK, I am here
[14:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345)
[14:06:36] <Hauskatze>	 zeljkof: Zoranzoki21 is here now :)
[14:06:42] <wikibugs>	 (03CR) 10jenkins-bot: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345)
[14:07:13] <Zoranzoki21>	 Oh, I forgot name of extension for checkiing
[14:07:57] <Zoranzoki21>	 OK, I found it and installed. I am now here
[14:08:18] <Jayprakash12345>	 Zfilipin: Thank you for merge
[14:08:31] <wikibugs>	 (03PS2) 10Rush: tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722)
[14:08:57] <zeljkof>	 Jayprakash12345: deploying it right now
[14:09:20] <Zoranzoki21>	 zeljkof: I am next?
[14:09:52] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:403342|Lift the cap on IP address to create accounts on mrwiki (T184579)]] (duration: 01m 04s)
[14:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:06] <stashbot>	 T184579: Request to lift the cap on IP address to create accounts on wiki - https://phabricator.wikimedia.org/T184579
[14:10:10] <zeljkof>	 Jayprakash12345: 403342 is deployed
[14:10:31] <zeljkof>	 Zoranzoki21: you are next, but I do not feel comfortable deploying your changes :(
[14:10:42] <Jayprakash12345>	 Zfilipin: Thank you very much.
[14:10:53] <Zoranzoki21>	 zeljkof: I can test. I have x-wikimedia debug
[14:10:57] <zeljkof>	 there is a good chance something will go wrong, and I am not familiar with the variables
[14:11:10] <wikibugs>	 (03PS3) 10Rush: tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722)
[14:11:32] <Zoranzoki21>	 zeljkof: If any reproduce problem, you can rollback patch
[14:11:42] <Zoranzoki21>	 zeljkof: I think to all will be ok, without problems
[14:12:34] <zeljkof>	 Zoranzoki21: since Hauskatze has only one patch, and it's simpler, I will deploy it first, and then look at your patches
[14:12:47] <Zoranzoki21>	 zeljkof: Ok
[14:13:14] <Hauskatze>	 fine for me
[14:13:24] <Hauskatze>	 let me know when you're ready and to test
[14:13:27] <Hauskatze>	 ty
[14:13:31] <zeljkof>	 Zoranzoki21: the problem is that I did not see reviews from anybody that is familiar with the code on the patches
[14:14:12] <wikibugs>	 (03CR) 10Legoktm: [C: 031] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn)
[14:14:47] <Zoranzoki21>	 zeljkof: Ok
[14:14:55] <zeljkof>	 Zoranzoki21: in order for me to merge the patches, try getting reviews from for example hasharAway, no_justification, Dereckson, anomie...
[14:15:22] <Zoranzoki21>	 They are already reviewers in patch
[14:15:28] <Zoranzoki21>	 But they no respond
[14:15:32] <zeljkof>	 Zoranzoki21: as it stands at the moment, I do not feel comfortable deploying such changes
[14:15:44] <zeljkof>	 Zoranzoki21: they might be reviewers, but did not provide any feedback
[14:15:45] <zeljkof>	 right?
[14:16:02] <Zoranzoki21>	 zeljkof: They are reviewers, but did not provide feedback
[14:16:07] <zeljkof>	 so...
[14:16:33] <Zoranzoki21>	 zeljkof: But, I no know why. They have to, if any is not ok, to tell it
[14:16:41] <zeljkof>	 do you get my point? until somebody from the phab ticket says the patches look good (silence is not approval), I will not deploy them
[14:17:20] <zeljkof>	 Zoranzoki21: people are busy, you have to make sure you get at least one positive review, preferably more
[14:17:27] <Zoranzoki21>	 zeljkof: Ok
[14:17:44] <zeljkof>	 Zoranzoki21: I do not want to earn "I broke wikipedia" t-shirt
[14:17:48] <zeljkof>	 not yet
[14:18:09] <zeljkof>	 Hauskatze: reviewing your commit
[14:18:22] <Zoranzoki21>	 zeljkof: Ok. If patches get positive review(s) I will add for next swat which come in it time
[14:18:30] <Zoranzoki21>	 zeljkof: Is it ok?
[14:18:30] <zeljkof>	 Zoranzoki21: please do
[14:18:34] <zeljkof>	 yes
[14:18:43] <Zoranzoki21>	 zeljkof: OK thank you
[14:18:44] <zeljkof>	 sorry for being careful, but it's my job not to break stuff :)
[14:18:57] <Zoranzoki21>	 zeljkof: OK, no problems. I know it
[14:19:20] <icinga-wm>	 PROBLEM - Host wtp2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:19:20] <icinga-wm>	 PROBLEM - Host wtp2006 is DOWN: PING CRITICAL - Packet loss = 100%
[14:19:47] <Jayprakash12345>	 Zoranzoki21: I sugest you to contract senior Member Before Deploy.
[14:19:59] <icinga-wm>	 RECOVERY - Host wtp2003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:19:59] <icinga-wm>	 RECOVERY - Host wtp2006 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:20:18] <Zoranzoki21>	 Jayprakash12345: OK
[14:20:28] <Jayprakash12345>	 Zoranzoki21: Changing in global Variable is very harmful.
[14:20:40] <Zoranzoki21>	 Jayprakash12345: Ok, I know. I already told any
[14:20:49] <zeljkof>	 Hauskatze: your patch is also bigger that I like for swat :)
[14:21:00] <zeljkof>	 can you test it at mwdebug1002?
[14:21:08] <Hauskatze>	 zeljkof: yes
[14:21:09] <zeljkof>	 what's the chance of things breaking?
[14:21:38] <Hauskatze>	 zeljkof: minimal, as it can be tested on mwdebug and Special:ListGroupRights. If the rights don't appear, we can revert
[14:21:39] <zeljkof>	 how much time do you need to test it? it touches many wikis, right?
[14:21:54] <Hauskatze>	 I'll do random checks on 3/4 wikis
[14:21:58] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio)
[14:22:18] <zeljkof>	 Hauskatze: ok, merging, will ping you when at mwdebug
[14:22:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3889821 (10MoritzMuehlenhoff) >>! In T184189#3888968, @Andrew wrote: > Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+...
[14:22:26] <Hauskatze>	 the config is already on CommonSettings after all since a week or so
[14:22:38] <Hauskatze>	 okay let me know :)
[14:23:27] <wikibugs>	 (03Merged) 10jenkins-bot: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio)
[14:23:41] <wikibugs>	 (03CR) 10jenkins-bot: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio)
[14:26:13] <zeljkof>	 Hauskatze: 402780 is at mwdebug1002
[14:26:20] <Hauskatze>	 ack, checking
[14:27:44] <Hauskatze>	 checks successful so far, I'll do some more zeljkof 
[14:28:46] <zeljkof>	 ok
[14:29:23] <Hauskatze>	 zeljkof: revert, I missed a line for commons
[14:29:40] <Hauskatze>	 or I can amend it really quick
[14:29:55] <Hauskatze>	 because the change is working after all
[14:30:09] <icinga-wm>	 PROBLEM - NTP on sca2003 is CRITICAL: NTP CRITICAL: Offset unknown
[14:30:11] <zeljkof>	 Hauskatze: if you can create another commit that fixes the problem, I can deploy both at the same time
[14:30:27] <zeljkof>	 we have 30 more minutes in the window
[14:31:15] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "Hi Zach. This is not deployed, because noone is not reviewed this patch except me. When any another review this, in next swat time, this s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[14:31:41] <Hauskatze>	 zeljkof: it's https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/CommonSettings.php;1e39c531186ca225cd7eb1efe5e059203ed366e2$2507
[14:31:49] <Hauskatze>	 change To From
[14:32:00] <Hauskatze>	 so let's deploy and I can amend that commonsettings thing
[14:32:29] <zeljkof>	 Hauskatze: wait, I did not understand
[14:32:40] <zeljkof>	 I can deploy the 402780?
[14:32:55] <zeljkof>	 and you will create follow up commit that fixes some problem?
[14:32:57] <Hauskatze>	 zeljkof: the patch is good and works as expected, yes; however there's a typo in CommonSettings
[14:33:13] <Hauskatze>	 and I'm creating the follow-up right now
[14:33:22] <zeljkof>	 ok, so I should deploy 402780? or wait for the follow-up?
[14:33:38] <Hauskatze>	 zeljkof: what's best to do, CS and later IS or vice-versa?
[14:33:41] <Hauskatze>	 cc Urbanecm 
[14:34:08] <zeljkof>	 CS? IS?
[14:34:41] <jynus>	 !log dropping wikidatawiki from dbstore2001:3315 T184599
[14:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:53] <stashbot>	 T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599
[14:35:08] <Platonides>	 Hauskatze: IS?
[14:35:27] <Hauskatze>	 CommonSettings and InitialiseSettings
[14:35:36] <wikibugs>	 (03CR) 10Rush: tools: ferm pre hook to stop kube-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush)
[14:35:43] <Platonides>	 I'm confused
[14:35:53] <Platonides>	 why not make both changes in the same changeset?
[14:36:01] <wikibugs>	 (03Draft1) 10MarcoAurelio: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410
[14:36:03] <wikibugs>	 (03PS2) 10MarcoAurelio: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410
[14:36:10] <Hauskatze>	 there it is^
[14:36:27] <Hauskatze>	 Platonides: fact is that you should sync one first
[14:37:07] <Hauskatze>	 https://gerrit.wikimedia.org/r/403410 is good to go, it's a typo fix
[14:37:37] <wikibugs>	 (03CR) 10Platonides: [C: 031] "This is indeed the right variable name" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio)
[14:38:18] <wikibugs>	 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889838 (10Bawolff)
[14:39:18] <Hauskatze>	 zeljkof: it's done
[14:39:37] <Hauskatze>	 Platonides: wrt. same patchset, I can't given that one is already merged
[14:39:39] <zeljkof>	 Hauskatze: I'm confused, 402780 changes only IS, 403410 changes only CS?
[14:39:57] <Hauskatze>	 zeljkof: CS typo prevented IS patch to fully work as expected
[14:40:10] <zeljkof>	 ok
[14:40:19] <Hauskatze>	 only with regards to administrators not being able to remove the permission from themselves
[14:40:24] <zeljkof>	 in which order should I deploy the files?
[14:40:34] <zeljkof>	 IS, then CS? vice-versa?
[14:40:37] <Hauskatze>	 CS then IS I'd say
[14:40:44] <zeljkof>	 ok
[14:41:00] <Hauskatze>	 if the order is vice-versa, we can re-scap
[14:41:07] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio)
[14:42:36] <wikibugs>	 (03Merged) 10jenkins-bot: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio)
[14:42:50] <wikibugs>	 (03CR) 10jenkins-bot: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio)
[14:42:53] <chasemp>	 !log new meltdown images are live in cloud land
[14:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:31] <zeljkof>	 Hauskatze: 403410 is at mwdebug1002, please confirm that things now work fine before the deployment
[14:43:41] <Hauskatze>	 ack
[14:43:50] <icinga-wm>	 PROBLEM - Host wtp2009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:50] <icinga-wm>	 PROBLEM - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:44] <Hauskatze>	 zeljkof: it does now
[14:44:49] <icinga-wm>	 RECOVERY - Host wtp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[14:44:54] <zeljkof>	 Hauskatze: ok to deploy?
[14:44:59] <icinga-wm>	 RECOVERY - Host wtp2016 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[14:45:01] <Hauskatze>	 yes from me
[14:45:07] <zeljkof>	 ok, deploying...
[14:45:12] <wikibugs>	 (03CR) 10Rush: [C: 032] tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush)
[14:45:42] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889866 (10BBlack) That looks about right (disable all hashes older than SHA256, disable RSA+DSA), although it's hard to suss exactly what th...
[14:46:28] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:403410|translationadmin: typo fix]] (duration: 01m 03s)
[14:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:46] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:402780|translationadmin: remove configuration equal to CommonSettings.php (T184314)]] (duration: 01m 02s)
[14:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:00] <stashbot>	 T184314: Redundant wmf-config for translationadmin - https://phabricator.wikimedia.org/T184314
[14:48:05] <ema>	 _joe_: mmh what was the deal with lvs2003?
[14:48:10] <zeljkof>	 Hauskatze: all deployed, please check and thanks for deploying with #releng ;)
[14:48:42] <_joe_>	 ema: some internal error in the http part after it tried to remove an alert
[14:48:44] <Hauskatze>	 I'm checking and no issues so far
[14:48:51] <Hauskatze>	 thanks for deploying for me
[14:49:10] <zeljkof>	 Hauskatze: no problem, please add the second commit to the calendar
[14:49:27] <zeljkof>	 logs look fine so far...
[14:49:37] <Hauskatze>	 zeljkof: sure, almost forgot
[14:50:04] <jynus>	 !log dropping dewiki from dbstore2001:3318 T184599
[14:50:07] <zeljkof>	 !log EU SWAT finished
[14:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:17] <stashbot>	 T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599
[14:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:47] <Hauskatze>	 done that
[14:51:48] <godog>	 !log start cassandra-a on restbase1011 - T184100
[14:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:58] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[14:53:08] <wikibugs>	 (03PS1) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411
[14:53:33] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889872 (10Ottomata) > Does that mean SHA1 is disabled, except in the cases that it's the root cert of a chain stored in the jdkCA's default...
[14:54:22] <Hauskatze>	 zeljkof: added to calendar and also marked as not done some https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400
[14:54:29] <ema>	 !log codfw LVSs: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267
[14:54:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:43] <stashbot>	 T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656
[14:55:27] <wikibugs>	 (03PS2) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411
[14:56:45] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388
[14:58:17] <wikibugs>	 (03PS3) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411
[14:58:49] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889885 (10BBlack) Yeah, seems reasonable to just set it system-wide on these systems.
[14:59:06] <wikibugs>	 (03CR) 10Rush: [C: 032] tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411 (owner: 10Rush)
[15:02:54] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3889896 (10Andrew) The load-testing command I've settled on is:   ``` sudo cumin --force --timeout 120 -o json "project:testlabs name:labvirt1...
[15:05:21] <zeljkof>	 Hauskatze: thanks
[15:06:20] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused
[15:06:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:06:29] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:07:56] <wikibugs>	 (03CR) 10Ema: [C: 031] contint: Lower caching length on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/403401 (https://phabricator.wikimedia.org/T184255) (owner: 10Legoktm)
[15:08:05] <godog>	 that's not expected, I'll take a look
[15:09:20] <icinga-wm>	 PROBLEM - Host wtp2008 is DOWN: PING CRITICAL - Packet loss = 100%
[15:09:30] <icinga-wm>	 PROBLEM - Host wtp2005 is DOWN: PING CRITICAL - Packet loss = 100%
[15:09:49] <icinga-wm>	 RECOVERY - Host wtp2008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[15:10:10] <icinga-wm>	 RECOVERY - Host wtp2005 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms
[15:12:36] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 04-1] "no consensus link at task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب)
[15:13:14] <wikibugs>	 (03PS1) 10Ottomata: Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993)
[15:13:29] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active
[15:14:18] <moritzm>	 !log reboot netmon1002 / netmon2001 for kernel security update
[15:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:45] <akosiaris>	 volans: seems like https://gerrit.wikimedia.org/r/#/c/400250/ is the culprit
[15:16:30] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9686/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[15:16:34] <volans>	 akosiaris: for the  ensure => 'present',?
[15:16:53] <akosiaris>	 yes
[15:17:01] <akosiaris>	 and I missed it in the review
[15:17:17] <volans>	 yeah in the yaml file we switch the role::tcpircbot::ensure
[15:17:30] <volans>	 thanks for looking into ti
[15:17:32] <volans>	 *it
[15:21:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2018-08-17 16:11:12 +0000 (expires in 219 days)
[15:22:29] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.037 second response time on 10.64.32.202 port 9042
[15:23:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix role::tcpircbot lookups for tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/403417
[15:24:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418
[15:24:52] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418
[15:30:06] <icinga-wm>	 RECOVERY - NTP on sca2003 is OK: NTP OK: Offset 9.244680405e-05 secs
[15:32:40] <moritzm>	 !log rebooting yubico auth servers for kernel security update
[15:32:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:14] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui)
[15:34:26] <icinga-wm>	 PROBLEM - Host wtp2011 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:37] <icinga-wm>	 PROBLEM - Host wtp2014 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:56] <icinga-wm>	 RECOVERY - Host wtp2011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[15:34:57] <icinga-wm>	 PROBLEM - MD RAID on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:34:57] <icinga-wm>	 PROBLEM - dhclient process on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:34:57] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:06] <icinga-wm>	 RECOVERY - Host wtp2014 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms
[15:35:06] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:16] <icinga-wm>	 PROBLEM - Disk space on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:16] <icinga-wm>	 PROBLEM - configured eth on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:16] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:35:17] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:36] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[15:35:37] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused
[15:35:37] <icinga-wm>	 PROBLEM - DPKG on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:40] <Amir1>	 hey, can this patch be merged? https://gerrit.wikimedia.org/r/#/c/403366 it's tiny
[15:35:46] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:35:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:47] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:47] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:35:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui)
[15:35:57] <Amir1>	 if it's not possible, let me know to put it in the puppet SWAT
[15:36:01] <godog>	 ugh, sorry about the spam
[15:36:06] <icinga-wm>	 PROBLEM - puppet last run on restbase1011 is CRITICAL: Return code of 255 is out of bounds
[15:36:46] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui)
[15:37:09] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T174569 (duration: 01m 03s)
[15:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:21] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[15:47:27] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3890015 (10Cmjohnson)
[15:47:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3890014 (10Cmjohnson) 05Open>03Resolved
[15:47:48] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2713121 (10Cmjohnson)
[15:47:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Phabricator, 10hardware-requests: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3890016 (10Cmjohnson) 05Open>03Resolved
[15:48:04] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2714212 (10Cmjohnson)
[15:48:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3890019 (10Cmjohnson) 05Open>03Resolved
[15:48:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3890021 (10Cmjohnson) 05Open>03Resolved
[15:48:22] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2714241 (10Cmjohnson)
[15:48:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3890023 (10Cmjohnson) 05Open>03Resolved
[15:48:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3890024 (10Cmjohnson) 05Open>03Resolved
[15:49:08] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2767459 (10Cmjohnson)
[15:49:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3890025 (10Cmjohnson) 05Open>03Resolved
[15:49:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3890027 (10Cmjohnson) 05Open>03Resolved
[15:50:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890028 (10Cmjohnson) @marostegui II have a used spare battery we can swap this out with.  LMK when you want to schedule this
[15:50:33] <wikibugs>	 (03PS13) 10Giuseppe Lavagetto: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey)
[15:52:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890043 (10Marostegui) @Cmjohnson you want me to power off the server and we can do it now?
[15:53:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Fix role::tcpircbot lookups for tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/403417 (owner: 10Alexandros Kosiaris)
[15:53:32] <cmjohnson1>	 @marostegui: no, not right now. Can we do later this afternoon or tomorrow morning? 
[15:53:58] <marostegui>	 cmjohnson1: tomorrow morning works for me :)
[15:54:23] <cmjohnson1>	 cool! I will ping you tomorrow 
[15:54:27] <marostegui>	 cool thanks!
[15:54:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890063 (10Marostegui) As per our chat, this will be done tomorrow
[15:56:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: Decommission mw1180-1200 - https://phabricator.wikimedia.org/T183895#3890070 (10Cmjohnson)
[15:56:16] <wikibugs>	 (03PS3) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378
[15:56:20] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378 (owner: 10Faidon Liambotis)
[15:57:23] <paravoid>	 JENKINS!
[15:57:26] <paravoid>	 wake up!
[15:57:37] <akosiaris>	 the fact we got an icinga bot that is called ircecho, but an effectively echoing bot called tcpircbot ...
[15:57:59] <godog>	 akosiaris: yeah that's endlessly confusing
[15:58:20] <godog>	 the whole multitude of irc bots slightly different but equal that is
[15:58:38] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569)
[15:59:38] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3890096 (10ArielGlenn) I'm adding @Nikerabbit, @demon and @hoo because they will be the main beneficiaries of this new host.  How do you see...
[15:59:41] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[15:59:45] <wikibugs>	 (03PS2) 10Ottomata: Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993)
[15:59:47] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata)
[15:59:57] <godog>	 !log start cassandra-a on restbase1011
[16:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:13] <wikibugs>	 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3890099 (10bmansurov) Also blocked on a final review by @DarTar and project owners.
[16:00:23] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[16:01:26] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890108 (10Andrew) I've run three load tests with the above command.  The last test started at Wed Jan 10 15:51:10 UTC 2018  {F12387667}  {F12...
[16:01:44] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[16:01:59] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[16:02:43] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown: Replace snapshot1001 with a proper testbed host (new hardware) - https://phabricator.wikimedia.org/T184616#3890113 (10ArielGlenn) p:05Triage>03Normal
[16:02:58] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T174569 (duration: 01m 02s)
[16:03:07] <marostegui>	 !log Deploy schema change on db1095.s5 - https://phabricator.wikimedia.org/T174569
[16:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:10] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[16:03:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:14] <moritzm>	 !log switched ganeti master node in codfw to ganeti2004
[16:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:18] <moritzm>	 !log migrating instances off ganeti2001 for subsequent reboot for kernel security update
[16:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:36] <godog>	 !log roll-restart swift frontend in eqiad for kernel upgrade
[16:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:25] <anomie>	 marostegui: Should I wait on T181731 s5 or can I go ahead? I think the only real risk is if it breaks replication again.
[16:14:25] <stashbot>	 T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731
[16:15:02] <icinga-wm>	 PROBLEM - DPKG on ms-fe1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[16:15:55] <marostegui>	 anomie: It should not break replication again, as we are not running row based. Right now there is one host running the alter tables (db1096) but it is depooled, so...
[16:16:02] <icinga-wm>	 RECOVERY - DPKG on ms-fe1005 is OK: All packages OK
[16:16:24] <marostegui>	 anomie: we also fixed consistency on dewiki and wikidata, so.. :)
[16:16:29] <anomie>	 Ok, thanks
[16:16:43] <marostegui>	 We cannot say it is 100% fixed of course, but it is in a lot better state now
[16:17:31] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3890148 (10faidon) @ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full ema...
[16:17:52] <icinga-wm>	 PROBLEM - Host wtp1034 is DOWN: PING CRITICAL - Packet loss = 100%
[16:17:52] <icinga-wm>	 PROBLEM - Host wtp1040 is DOWN: PING CRITICAL - Packet loss = 100%
[16:18:02] <icinga-wm>	 RECOVERY - Host wtp1034 is UP: PING OK - Packet loss = 0%, RTA = 36.61 ms
[16:18:11] <icinga-wm>	 RECOVERY - Host wtp1040 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms
[16:22:46] <ottomata>	 !log restarting kafka jumbo brokers to apply java.security certpath restrictions
[16:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:41] <anomie>	 !logging Running cleanupUsersWithNoId.php on dewiki and wikidatawiki
[16:26:41] <wm-bot>	 To log a message, use the following format: !log <project> <message>
[16:26:45] <anomie>	 !log Running cleanupUsersWithNoId.php on dewiki and wikidatawiki
[16:26:52] <icinga-wm>	 PROBLEM - Host wtp1031 is DOWN: PING CRITICAL - Packet loss = 100%
[16:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:11] <icinga-wm>	 PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100%
[16:27:15] <akosiaris>	 I have definitely scheduled downtimes for the wtp10XX hosts.... what on earth
[16:28:02] <icinga-wm>	 RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms
[16:28:18] <icinga-wm>	 RECOVERY - Host wtp1031 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[16:29:25] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[16:29:51] <akosiaris>	 silly me... seconds
[16:30:04] <akosiaris>	 godog: thumbor known ?
[16:30:08] <wikibugs>	 (03PS1) 10Cmjohnson: adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425
[16:30:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[16:30:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425 (owner: 10Cmjohnson)
[16:30:16] <godog>	 akosiaris: no :(
[16:30:19] <godog>	 I'll check
[16:30:34] <akosiaris>	 Bad response from pybal ?
[16:30:38] <ema>	 looking
[16:30:44] <wikibugs>	 10Operations, 10procurement: Give access to S4 (procurement tasks) to Erika Bjune - https://phabricator.wikimedia.org/T184617#3890177 (10Gehel)
[16:30:47] <wikibugs>	 (03PS2) 10Cmjohnson: adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425
[16:30:52] <akosiaris>	 that's the passive though
[16:31:10] <ema>	 akosiaris: it is, yes. Earlier on today, lvs2003 had the same issue
[16:31:36] <akosiaris>	 yeah and we bounced pybal IIRC
[16:31:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[16:32:11] <akosiaris>	 ah there we go.. that's more like it.. it explains the page 
[16:32:18] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([thumbor1004.eqiad.wmnet, thumbor1002.eqiad.wmnet])
[16:32:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[16:33:03] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3890215 (10Ottomata) Oook, I've set this on all jumbo Kafka brokers.  @bblack anything else?
[16:33:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[16:33:52] <ema>	 http://localhost:9090/alerts and http://localhost:9090/pools were fine on lvs1006 when I checked a couple of minutes ago  
[16:34:18] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[16:34:29] <godog>	 still looking into thumbor btw
[16:34:58] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[16:35:07] <godog>	 !log bounce thumbor-instances on thumbor1001
[16:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:35] <bblack>	 this isn't the first pybal 500 we've had today
[16:35:38] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[16:35:47] <bblack>	 we must have some bug related to the depooling process here...
[16:36:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[16:36:43] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 172 bytes in 18.767 second response time
[16:36:49] <ema>	 ok I've got the 500 response body from lvs1003
[16:36:49] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refused
[16:37:02] <ema>	 > Servers thumbor1004.eqiad.wmnet, thumbor1002.eqiad.wmnet, thumbor1003.eqiad.wmnet are marked down but pooled
[16:37:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error
[16:37:15] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 80: Connection refused
[16:37:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[16:37:20] <jynus>	 how can I help?
[16:37:21] <godog>	 heh, also I'm pretty sure ms-fe is ok, I was rolling-restart its backends though
[16:37:29] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal
[16:37:34] <volans>	 did we merged anything related recently?
[16:37:47] <jynus>	 so all false positives or just no impact because pool state?
[16:37:50] <bblack>	 the failing "LVS HTTP" check above is real, though
[16:38:05] <ema>	 volans: nope, but I've rebooted all LVSs in eqiad and codfw today
[16:38:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[16:38:10] <bblack>	 the one that says: PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refuseda
[16:38:37] <godog>	 I'm repooling ms-fe1008
[16:38:39] <volans>	 ema: ack, and the etcd connection is ok
[16:38:39] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[16:38:45] <volans>	 ?
[16:38:55] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1008.eqiad.wmnet
[16:39:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:15] <bblack>	 volans: that's cache_upload reqs failing due to ms-fe.svc outage
[16:39:21] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1006.eqiad.wmnet
[16:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:44] <volans>	 bblack: yeah, my question mark was for my previous sentence ;)
[16:39:54] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.170 second response time
[16:40:12] <volans>	 I got a connection refused too on 10.2.2.27:443
[16:40:15] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.141 second response time
[16:40:20] <volans>	 but not anymore
[16:40:28] <jynus>	 there is a small spike before a large spike
[16:41:04] <bblack>	 so, if we assume the intended depool plan was sane (didn't depool more than threshold to do reboots or whatever), then there's something wrong on the pybal end here
[16:41:31] <wikibugs>	 10Operations, 10procurement: Give access to S4 (procurement tasks) to Erika Bjune - https://phabricator.wikimedia.org/T184617#3890252 (10RobH) 05Open>03Resolved Added!  @EBjune please be aware that any task with 'Operations Procurement' in the title (in the S4 space) are now visible to you.  Please do NOT...
[16:41:53] <jynus>	 it would be nice to have not only deployments, but also conftool changes on logstash :-)
[16:41:53] <godog>	 bblack: it was, though I thought ms-fe1008 was pooled and it wasn't
[16:42:02] <bblack>	 well either way there's something wrong on the pybal end if it's throwing a 500 I think
[16:42:16] <godog>	 for sure
[16:42:31] <jynus>	 scary
[16:42:41] <bblack>	 maybe we should take a pause on the depools/reboots and figure that part out first
[16:42:52] <ema>	 this is what the 500 from pybal looked like: https://phabricator.wikimedia.org/P6568 
[16:42:54] <bblack>	 but I bet it's related to depool_threshold
[16:43:05] <godog>	 yup I'll hold the rolling restart
[16:43:23] <akosiaris>	 !log wtp* rolling restarts for meltdown finished
[16:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:42] <wikibugs>	 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3890264 (10Halfak) OK great.  I'll go +1 :)
[16:44:06] <bblack>	 because I don't think we ever really resolved the depool threshold issue yet, even in the latest versions.  and we may have changed something about it.
[16:44:56] <bblack>	 (the old general-case issue being that if a server going down crosses the threshold mark, some state is lost about that situation without separate concepts of "wants-to-be-depooled" vs "is-depooled")
[16:45:09] <wikibugs>	 (03CR) 10Halfak: [C: 031] ""keep_days" is a scary parameter name.  I've confirmed with Filippo that this means "delete_files_not_modified_since_days".  So it looks g" [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) (owner: 10Filippo Giunchedi)
[16:45:33] <wikibugs>	 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890280 (10RobH) p:05Triage>03Normal
[16:46:30] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] "Couple of inline comments. Seems fine overall." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402803 (owner: 10ArielGlenn)
[16:46:39] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: CRITICAL - kafka_broker_under_replicated_partitions is 14 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1003
[16:47:04] <bblack>	 as far as the configurations and pool-sets go: swift-fe only has 4x servers per DC, and depool threshold is 0.5
[16:47:17] <bblack>	 so depooling a 3/4 puts us in that state
[16:47:50] <bblack>	 thumbor is the same (4/DC, threshold = 0.5)
[16:48:07] <ema>	 it looks like we had 0 servers pooled at a certain point? https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=swift_80
[16:49:01] <wikibugs>	 (03CR) 10Elukey: "Don't have a lot of context about puppetdb to fully review this but code looks sane and pcc is fine! https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/403388 (owner: 10Giuseppe Lavagetto)
[16:49:23] <bblack>	 yeah, it's possible there was some operation sequence issue there and we actually did depool > threshold
[16:49:50] <bblack>	 the secondary issue is: I don't think pybal handles depools>threshold sanely (it never did, but now it does something differently-bad in newer code)
[16:51:39] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[16:51:50] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[16:51:58] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10Tonina_Zhelyazkova_WMDE)
[16:52:40] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[16:52:43] <wikibugs>	 (03PS1) 10RobH: Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582)
[16:52:45] <wikibugs>	 (03CR) 10Elukey: "Looks sane from https://puppet-compiler.wmflabs.org/compiler02/9689/nitrogen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey)
[16:52:52] <godog>	 so in terms of sequence I started by assuming all 4x ms-fe machines were pooled, and started depooling 1005, reboot, repool
[16:53:10] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[16:53:25] <godog>	 then moved onto 1006, depool, reboot
[16:53:36] <godog>	 didn't get to repool before things went sideways
[16:53:59] <elukey>	 can I reboot some analytics hadoop worker nodes? (no pybal involved)
[16:54:47] <ema>	 elukey: yes
[16:54:50] <elukey>	 <3
[16:55:02] <elukey>	 !log reboot analytics1047->50 for kernel updates
[16:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:51] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10Platonides) I guess your manager at WMDE should confirm here that you are indeed a WMDE developer?
[16:56:06] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890319 (10RobH)
[16:56:08] <ema>	 godog: right so you were expecting all 4 machines being pooled, but the graph above shows that only 3 hosts where pooled today 
[16:57:06] <godog>	 ema: yeah, and the three pooled is likely since yesterday when I did another roll-restart, before realizing the kernel wasn't upgraded
[16:57:20] <godog>	 yesterday's roll restart was fine though, a machine at a time
[16:57:25] <bblack>	 right, so 1/4 was already gone and not noticed, then 2x more depools -> threshold
[16:57:39] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: OK - kafka_broker_under_replicated_partitions is 4 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1003
[16:57:55] <wikibugs>	 (03PS1) 10Zoranzoki21: Add throttle rule for Paris University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618)
[16:58:02] <bblack>	 and then pybal seems to at least not handle threshold-limited depools in its HTTP outputs
[16:58:14] <ema>	 yup
[16:58:17] <godog>	 yes though the 2x depools weren't (supposed to be) overlapping
[16:58:19] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890327 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete
[16:58:21] <bblack>	 and then I guess we don't know without more digging what caused the conn-refused on the LVS service
[16:58:38] <bblack>	 it could just be the remaining 1 (or 2?) servers actually couldn't handle the connection load
[16:58:45] <ema>	 the fact that no hosts were pooled for the service I guess
[16:58:49] <wikibugs>	 (03PS2) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618)
[16:59:00] <bblack>	 or it could be that pybal screwed up ipvs state and blocked connections even though it should've kept 2x pooled due to threshold
[16:59:30] <godog>	 yeah I wouldn't be surprised if 1x ms-fe can't handle the load, 2x I'm not sure
[16:59:30] <bblack>	 (or are manual depools supposed to be able to exceed thresholds?)
[16:59:57] <godog>	 I'm checking the hosts to see if some got obviously overloaded
[17:00:02] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890343 (10RobH)
[17:00:30] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890357 (10Marostegui) Thanks - will close once it has finished: ```       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete)       physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Rebuilding)...
[17:02:11] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889014 (10RobH) @EBjune: Please comment with your approval of this expansion of access rights (as @bawolff's manager.)  Thanks!
[17:04:41] <godog>	 I'm looking at ms-fe1* network graphs and indeed looks like at 16:34 pooled servers went to 0 and hosts stopped receiving traffic
[17:07:57] <godog>	 and ms-fe1007 at ~16:25 went to 100% cpu, probably under the swings of traffic moving around
[17:08:21] <volans>	 godog: like the repool didn't actually repool it?
[17:08:37] <_joe_>	 that's not the case
[17:08:44] <_joe_>	 if you go look at pybal's logs
[17:08:52] <godog>	 volans: no, 1007 stayed pool the whole time afaik
[17:09:25] <ema>	 also we've got an icinga check for that, which didn't trigger (check_pybal_ipvs_diff)
[17:09:38] <_joe_>	 ema: a check for what?
[17:10:08] <ema>	 _joe_: for what I think volans mentioned, a repool that didn't actually repool 
[17:10:18] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10RobH) So the L2 actually won't get you any WMF LDAP flags.  We actually need an NDA on file with WMF legal and a few other things:  [] - have a signed WM...
[17:10:57] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[17:10:58] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:10:59] <_joe_>	 ema: is that the case? I don't see that in the logs
[17:11:38] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[17:11:57] <icinga-wm>	 PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[create_user-replication@netmon2001],Exec[create_user-netbox@netmon2001],Exec[create_user-netbox@localhost],Exec[create_user-prometheus@localhost]
[17:12:02] <ema>	 _joe_: right, that's what I'm saying. If that were the case, check_pybal_ipvs_diff would have alerted
[17:12:24] <_joe_>	 oh ok I didn't understand :)
[17:12:43] <_joe_>	 so from what I see, there was an issue fetching data from swift at 16:34:06
[17:13:14] <godog>	 does it say from what host?
[17:13:22] <_joe_>	 ms-fe1005
[17:13:28] <_joe_>	 is what I'm looking at now
[17:13:31] <_joe_>	 btw
[17:13:41] <_joe_>	 it's still failing
[17:14:25] <_joe_>	 and ms-fe1007
[17:14:25] <godog>	 with what error?
[17:14:49] <_joe_>	 sorry, it's not failing anymore, it spopped at 16:49:50
[17:15:36] <godog>	 yeah that's general recovery I'd say, what was the error from 1005 ?
[17:15:39] <_joe_>	 while ms-fe1007 went down just a bit before (16:33:28) and came back earlier (16:44:26)
[17:15:55] <_joe_>	 from both the error is 
[17:16:07] <_joe_>	 WARN: ms-fe1005.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 5.007 s
[17:16:16] <_joe_>	 ProxyFetch failing and taking more than 5 seconds
[17:17:03] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): tools.iabot is using 1.3T of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183953#3890483 (10bd808)
[17:17:06] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#3890484 (10bd808)
[17:17:09] <wikibugs>	 10Operations, 10Cloud-VPS, 10monitoring, 10cloud-services-team (Kanban): remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3890485 (10bd808)
[17:17:10] <_joe_>	 so you disabled 1006 while ms-fe1005 and 1007 were failing
[17:17:19] <_joe_>	 causing probably an overload of 1008 too
[17:17:22] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3890490 (10bd808)
[17:17:27] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3890492 (10bd808)
[17:17:30] <wikibugs>	 10Operations, 10Cloud-Services, 10hardware-requests, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3890493 (10bd808)
[17:17:32] <godog>	 no, 1008 wasn't pooled so 1007 got overloaded
[17:17:33] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3890494 (10bd808)
[17:18:04] <godog>	 likely 1005 too, so traffic swung too fast among too few machines
[17:18:08] <Masha>	 Who wants to see my naked photos in the link download  http://bit.ly/2CYpsCy
[17:18:18] <_joe_>	 godog: looks like it
[17:20:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base::resolving: remove useless "else" clause [puppet] - 10https://gerrit.wikimedia.org/r/403439
[17:20:26] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440
[17:20:50] <godog>	 ok, so in the root cause there's for sure my mistake of shuffling (de)pools too fast I'd say, and there were three hosts instead of four pooled
[17:21:17] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:21:33] <godog>	 I'll write an incident report about it, maybe there's followup we can do
[17:21:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440 (owner: 10Giuseppe Lavagetto)
[17:22:08] <ema>	 godog: has there actually been at any point 0 hosts pooled? That's what the grafana board suggests, it would be good to find out if it's reliable or not :)
[17:22:34] <godog>	 ema: when all frontends were overloaded I guess there were yeah, but not intentionally
[17:22:57] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:23:01] <_joe_>	 ema: I think 1008 was still pooled
[17:23:07] <_joe_>	 can someone look into pdfrender?
[17:23:14] <_joe_>	 why are they tying in sequence?
[17:23:39] <godog>	 1008 wasn't pooled, if it was then we'd have been fine I think
[17:23:40] <_joe_>	 *dying
[17:24:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Commons should not be included manually. Every throttle rule is applied to Conmons, along with Wikidata and other defined projects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21)
[17:26:47] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time
[17:27:17] <_joe_>	 did someone fix pdfrender or did it recover by itself?
[17:28:07] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890596 (10Tobi_WMDE_SW)
[17:31:17] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time
[17:31:56] <_joe_>	 so on 1004 it recovered by itself
[17:33:37] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys.
[17:34:34] <addshore>	 hi twentyafterfour!
[17:34:42] <twentyafterfour>	 Hi!
[17:34:58] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys.
[17:35:14] <addshore>	 So, there are some backports for the things spotted yesterday, I have basically only just finished with them though. Daniel is looking through them now :)
[17:35:30] <twentyafterfour>	 ok
[17:35:34] <twentyafterfour>	 I just saw some patches
[17:35:44] <addshore>	 3 for core and possibly 1 for FlaggedRevisions, although the 1 in FlaggedRevisions is also covered by one of the core patches :)
[17:36:49] <twentyafterfour>	 Well there is no rush on my part. It's a couple of hours away from train time but as soon as you're ready I'll deploy group0 so that we can get back on track for group1 this afternoon.
[17:36:58] <wikibugs>	 10Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#3890626 (10ArielGlenn) Ummm.. still wanted? Can we close as impossible or no longer needed?
[17:37:05] <addshore>	 twentyafterfour: yup! okay :)
[17:37:07] <icinga-wm>	 PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:37:26] <addshore>	 will see if we can get them ready for the next swat
[17:37:55] <twentyafterfour>	 addshore: if not I can deploy them with the train
[17:38:29] <addshore>	 twentyafterfour: ack! :)
[17:41:23] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890637 (10Andrew) The terrible way to fix grub on Trusty VMs is:  sudo cumin --force --timeout 120 -o json  "a:All" "lsb_release -si | grep U...
[17:41:53] <volans>	 andrewbogott: it's A:all, not a:All ;)
[17:42:00] <andrewbogott>	 thx
[17:44:33] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890647 (10Tobi_WMDE_SW) >>! In T184620#3890317, @Platonides wrote: > I guess your manager at WMDE should confirm here that you are indeed a WMDE developer?  >>! In...
[17:44:39] <jynus>	 !log upgrade and restart db2086
[17:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:20] <andrewbogott>	 !log installing linux-image-generic-lts-xenial on labtestvirt2003
[17:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:02] <jynus>	 !log upgrade and restart db2087
[17:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:42] <wikibugs>	 (03CR) 10ArielGlenn: make role::beta::mediawiki into a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402803 (owner: 10ArielGlenn)
[18:02:07] <icinga-wm>	 RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[18:05:03] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Promote db2040 to be the codfw-s7 master instead of db2029 [puppet] - 10https://gerrit.wikimedia.org/r/403451 (https://phabricator.wikimedia.org/T176243)
[18:09:52] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890686 (10RStallman-legalteam) @Tonina_Zhelyazkova_WMDE  I'll create the NDA for your electronic signature and route it to your WMDE email address. I'll send an up...
[18:12:08] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[18:13:07] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Promote db2040 as the new codfw-s7 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403453 (https://phabricator.wikimedia.org/T176243)
[18:13:10] <wikibugs>	 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3890704 (10Techyan) @Krenair @herron   Thanks! I guess this information is enough for them.
[18:13:14] <wikibugs>	 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3890706 (10Techyan) 05Open>03Resolved
[18:16:57] <icinga-wm>	 RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:18:32] <addshore>	 twentyafterfour: all of the patches are up in the .16 branch now, I wont bother adding them to swat
[18:18:49] <addshore>	 adding you as a reviewer now, I'll be around again when the train runs :) gimmie a ping :D
[18:18:58] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469)
[18:19:02] <twentyafterfour>	 addshore: thanks
[18:19:25] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3890726 (10jcrespo) a:03jcrespo
[18:19:28] * addshore goes to make food
[18:20:16] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3890733 (10jcrespo)
[18:20:20] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3890734 (10jcrespo)
[18:20:22] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3890730 (10jcrespo) 05Open>03stalled a:05jcrespo>03None
[18:22:25] <addshore>	 twentyafterfour: you also might be able to answer this question for me! One of the patches adds a new log channel called "RevisionStore", will that automatically show up in logstash, or do I need to do something with wmgMonologChannels ?
[18:22:55] <addshore>	 wmgMonologChannels says // Defaults: [ 'udp2log'=>'debug', 'logstash'=>'info', 'kafka'=>false, 'sample'=>false ], and the logging in RevisionStore, so at a guess it will land in logstash, but just wanted to confirm
[18:23:17] <jynus>	 !log upgrade and restart db2040
[18:23:18] <twentyafterfour>	 addshore: I'm not sure
[18:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:38] <twentyafterfour>	 I would guess you're right
[18:28:43] <wikibugs>	 10Operations, 10monitoring: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3890758 (10Volans)
[18:32:57] <icinga-wm>	 RECOVERY - HP RAID on db2060 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK
[18:34:37] <icinga-wm>	 PROBLEM - Host labtestvirt2001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:37] <icinga-wm>	 RECOVERY - Host labtestvirt2001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[18:36:24] <wikibugs>	 (03PS1) 10Andrew Bogott: labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189)
[18:37:26] <wikibugs>	 (03PS2) 10Rush: labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott)
[18:37:30] <wikibugs>	 (03CR) 10Rush: [C: 031] labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott)
[18:38:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott)
[18:38:07] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890799 (10jcrespo) 05Open>03Resolved a:05Marostegui>03Papaul
[18:38:28] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890802 (10chasemp)
[18:40:22] <andrewbogott>	 !log upgrading labvirt1018 kernel and rebooting
[18:40:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:25] <chasemp>	 !log reboot labtestvirt2002.codfw.wmnet w/ new kernel
[18:45:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:15] <wikibugs>	 (03PS3) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618)
[18:46:38] <icinga-wm>	 PROBLEM - Host labtestvirt2002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:48:16] <wikibugs>	 (03PS4) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618)
[18:49:26] <Zoranzoki21>	 zeljkof: I am here
[18:50:00] <greg-g>	 Zoranzoki21: he most likely is not, what are you pinging him regarding?
[18:50:30] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890854 (10EBjune) @RobH I approve of @Bawolff's expansion of access rights for the analytics cluster, thank you!
[18:50:52] <Zoranzoki21>	 greg-g: Because, I am finally here per rule to user need to be on irc channel when is swat time and have patch for it
[18:51:25] <Zoranzoki21>	 greg-g: Zeljko never no deploy patch if owner of patch is not on irc in swat time when is patch for it scheduled
[18:51:50] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890859 (10chasemp)
[18:52:04] <greg-g>	 zeljko does not do this specific swat window, it's passed his work hours
[18:52:09] <greg-g>	 past*
[18:52:14] <greg-g>	 cc quiddity :P
[18:52:36] <greg-g>	 Zoranzoki21: just stick around, who ever does do the swat will ping people with patches
[18:52:40] <quiddity>	 <3  ;)
[18:52:45] <Zoranzoki21>	 ok
[18:59:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "labvirts: whitelist the post-meltdown kernel version" [puppet] - 10https://gerrit.wikimedia.org/r/403456 (https://phabricator.wikimedia.org/T184639)
[19:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1900).
[19:00:04] <jouncebot>	 Zoranzoki21: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Revert "labvirts: whitelist the post-meltdown kernel version" [puppet] - 10https://gerrit.wikimedia.org/r/403456 (https://phabricator.wikimedia.org/T184639) (owner: 10Andrew Bogott)
[19:00:37] <jynus>	 !log upgrade and restart db1059
[19:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:45] <jynus>	 the proxies is going to be me, see above
[19:03:38] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[19:04:07] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[19:06:31] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890903 (10chasemp)
[19:09:16] <thcipriani>	 I can SWAT
[19:09:57] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.036 second response time on 10.64.0.117 port 9042
[19:10:25] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21)
[19:11:37] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890936 (10RobH)
[19:12:00] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21)
[19:12:16] <wikibugs>	 (03CR) 10jenkins-bot: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21)
[19:14:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1059 BBU issues - https://phabricator.wikimedia.org/T184160#3890941 (10jcrespo)
[19:15:23] <thcipriani>	 Zoranzoki21: thanks for the patch, I will go ahead and deploy it everywhere since it is a simple throttle change
[19:15:39] <Zoranzoki21>	 ок
[19:15:43] <Zoranzoki21>	 ok
[19:16:39] <jynus>	 proxies should come back now
[19:16:47] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0
[19:17:07] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0
[19:18:57] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-b valid until 2018-08-17 16:11:09 +0000 (expires in 218 days)
[19:19:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "This change has now been assigned its own deployment window (2018-01-11T13:00:00Z/PT1H), so I’ll have one hour to test it on one of the mw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403195 (https://phabricator.wikimedia.org/T181060) (owner: 10Lucas Werkmeister (WMDE))
[19:22:18] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:403432|Add throttle rule for Paris University and sort other by date]] T184618 (duration: 01m 03s)
[19:22:27] <thcipriani>	 ^ Zoranzoki21 live everywhere now
[19:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:31] <stashbot>	 T184618: Request to lift account creation throttling on 2018-01-11 - https://phabricator.wikimedia.org/T184618
[19:22:39] <Zoranzoki21>	 thcipriani: Thank you
[19:22:51] <thcipriani>	 you're welcome :)
[19:32:46] <urandom>	 !log bootstrapping restbase1011-b -- T184100
[19:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:59] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[19:34:18] <wikibugs>	 (03PS1) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464
[19:34:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[19:35:34] <wikibugs>	 (03PS2) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464
[19:35:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[19:36:27] <icinga-wm>	 PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:45:58] <jynus>	 !log upgrade and restart db2047
[19:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:12] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Remove firejail config for now-unused ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) (owner: 10Brion VIBBER)
[20:00:04] <jouncebot>	 no_justification: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T2000).
[20:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:00:39] <jynus>	 !log upgrade and restart dbstore2001
[20:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:57] <wikibugs>	 (03CR) 10Subramanya Sastry: Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[20:02:30] <wikibugs>	 (03PS1) 10Brion VIBBER: Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470
[20:05:38] <jynus>	 !log upgrade and restart dbstore2002
[20:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:39] <wikibugs>	 (03CR) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[20:09:11] <logmsgbot>	 !log otto@tin Started deploy [eventstreams/deploy@ee854df]: Update eventstreams deploy test to scb2002: T171011
[20:09:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:35] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@ee854df]: Update eventstreams deploy test to scb2002: T171011 (duration: 00m 24s)
[20:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:59] <logmsgbot>	 !log otto@tin Started deploy [eventstreams/deploy@ee854df]: Update eventstreams with newer service-template-node: T171011
[20:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:09] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3453087 (10Imarlier) Is the goal here just to quantify the impact?  Or is there a target connect time/query time that we're tr...
[20:12:55] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER)
[20:14:10] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@ee854df]: Update eventstreams with newer service-template-node: T171011 (duration: 04m 11s)
[20:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:48] <wikibugs>	 10Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#3891132 (10Dzahn) 05Open>03declined Yes, I think so, Ariel. and thanks Tim for the details above. As the task creator i'll call it 'declined' but fine with me.
[20:16:32] <addshore>	 twentyafterfour: there is also one on FlaggedRevs (just incase you didnt spot it)
[20:18:13] <twentyafterfour>	 addshore: yeah I think I +2'd that one too
[20:18:21] <twentyafterfour>	 https://gerrit.wikimedia.org/r/#/c/403443/
[20:18:51] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 031] Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[20:20:43] <addshore>	 yup
[20:24:53] <wikibugs>	 (03PS3) 10Dzahn: Replace yubikey nano key with yubikey 4 key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/403095 (owner: 10Aaron Schulz)
[20:25:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "verified via file in tin home dir" [puppet] - 10https://gerrit.wikimedia.org/r/403095 (owner: 10Aaron Schulz)
[20:26:22] <mutante>	 AaronSchulz: ^ now i understand what you meant :) i found the file on tin like last time, verified, merged 
[20:27:46] <AaronSchulz>	 heh, thanks
[20:29:16] <mutante>	 AaronSchulz: yw! puppet ran on tin and bast1001. that combo should work already 
[20:40:16] <wikibugs>	 (03CR) 1020after4: [C: 032] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn)
[20:49:37] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.31.0-wmf.16: Sync wmf.16 to deploy multiple patches from addshore refs T180749 (duration: 10m 23s)
[20:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:47] <stashbot>	 T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749
[20:53:26] <wikibugs>	 (03CR) 10Krinkle: Switch to YAML configuration for Parsoid on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[20:58:36] <wikibugs>	 (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475
[20:58:38] <wikibugs>	 (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4)
[20:59:27] <addshore>	 twentyafterfour: just realised testwikidatawiki appears in group1 on logstash, when it is actually in group0 i believe....
[20:59:49] <addshore>	 looks like the sync of the patches above made the exceptions disappear :)
[21:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T2100).
[21:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[21:00:19] <twentyafterfour>	 addshore: cool
[21:00:30] <twentyafterfour>	 not sure why it's listed in group1.. hmm
[21:01:30] <hasharAway>	 twentyafterfour: addshore: this european morning there was an uncommited wikiversions.json on tin
[21:01:40] <hasharAway>	 and I have commited it to a change in gerrit
[21:01:47] <addshore>	 twentyafterfour: i guess that is a logstash dashboard issue
[21:02:01] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER)
[21:02:12] <hasharAway>	 https://gerrit.wikimedia.org/r/#/c/403360/1/wikiversions.json
[21:02:18] <twentyafterfour>	 hasharAway: that was because of the train getting held up before going to group 1
[21:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4)
[21:02:31] <hasharAway>	 for logstash, mabye the list of wikis are hardcoded manually
[21:02:38] <twentyafterfour>	 I mean group0
[21:02:39] <addshore>	 hasharAway: indeed
[21:02:46] <wikibugs>	 (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4)
[21:02:59] <awight>	 Nothin gfor ORES
[21:04:36] <subbu>	 nothing for parsoid
[21:05:36] <wikibugs>	 (03PS3) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464
[21:05:47] <wikibugs>	 (03PS2) 1020after4: load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn)
[21:05:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[21:06:04] <wikibugs>	 (03CR) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[21:06:18] <wikibugs>	 (03CR) 1020after4: [V: 032 C: 032] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn)
[21:09:26] <logmsgbot>	 !log twentyafterfour@tin Started scap: group0 to 1.31.0-wmf.16 refs T180749
[21:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:39] <stashbot>	 T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749
[21:11:26] <wikibugs>	 (03CR) 10jenkins-bot: load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn)
[21:13:49] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3891351 (10Krinkle)
[21:30:52] <wikibugs>	 (03CR) 10Dzahn: "@Ladsgroup i have heard you have done work on standardizing error page style before (for dumps?) as part of a general update" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[21:30:53] <wikibugs>	 10Operations, 10Puppet: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3891421 (10herron) So we’ll need to select a puppetdb version and package to proceed.  Puppetdb 4.4 looks like the version we should target as according to puppetlabs docs it’s the newest release still...
[21:31:45] <wikibugs>	 (03PS4) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710
[21:32:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 (owner: 10Dzahn)
[21:35:47] <wikibugs>	 10Operations, 10Goal: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) - https://phabricator.wikimedia.org/T65899#3891449 (10ArielGlenn)
[21:35:50] <wikibugs>	 10Operations, 10Goal, 10HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#3891448 (10ArielGlenn)
[21:35:54] <wikibugs>	 10Operations, 10Dumps-Generation, 10HHVM, 10Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3891443 (10ArielGlenn) 05stalled>03declined Officially declining, move to php7 has been approved, see  T176370  I've been working on a dump instance in...
[21:37:58] <icinga-wm>	 PROBLEM - Host mw1271 is DOWN: PING CRITICAL - Packet loss = 100%
[21:38:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 50.28, 34.67, 22.00
[21:38:35] <twentyafterfour>	 uhm
[21:38:50] <addshore>	 *looks*
[21:39:05] <Reedy>	 It's dead, jim
[21:39:08] <twentyafterfour>	 current deploy is at 88%
[21:39:15] <twentyafterfour>	 with 1 node failure
[21:39:25] * twentyafterfour wonders if I need to roll back real quick
[21:40:11] <twentyafterfour>	 there isn't anything of note in fatalmonitor that I can see
[21:40:13] <Seddon>	 Reedy! What did you do!
[21:40:15] <Reedy>	 I wouldn't say so yet
[21:40:15] <Seddon>	 :P
[21:40:24] <Reedy>	 Seddon: Fixed it till it was broken
[21:40:31] <Seddon>	 Reedy: Of course :P
[21:42:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 49.54, 37.77, 26.00
[21:42:43] <twentyafterfour>	 hmm at least it's not getting much worse?
[21:42:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.64, 29.20, 24.15
[21:42:54] <twentyafterfour>	 uh oh
[21:42:57] <twentyafterfour>	 that's a different one
[21:42:59] <addshore>	 hmmm, is it these app servers?
[21:43:01] <addshore>	 thats 2
[21:43:07] <Reedy>	   927 www-data  20   0 27.601g 2.901g 114480 S  1513  4.6   6621:55 hhvm
[21:43:07] <Reedy>	   908 nutcrac+  20   0   69228  46768   2156 S  24.2  0.1 139:28.64 nutcracker
[21:43:21] <twentyafterfour>	 wth
[21:43:34] <twentyafterfour>	 the scap errors are from mw1271.eqiad.wmnet
[21:43:36] <Reedy>	 A big request parsing stuff?
[21:43:56] <Reedy>	 load average: 27.65, 34.25, 25.89
[21:43:58] <Reedy>	 It's coming down
[21:44:06] <Reedy>	 host isn't unresponsive
[21:44:27] <Reedy>	 load average: 21.55, 32.18, 25.47
[21:45:06] <twentyafterfour>	 hmm I see a lot of stuff in logstash that just looks like a big long list of usernames, split up over multiple log entries
[21:45:39] <Reedy>	 load average: 17.58, 28.58, 24.73
[21:45:47] <Reedy>	 There's reasons we shouldn't let the users have nice things
[21:45:55] <addshore>	 I saw a bunch of stuff "Pool error on {key}: {error}"
[21:46:37] <Reedy>	 load average under 15
[21:46:40] * Reedy kicks icinga-wm
[21:47:27] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1201 is OK: OK - load average: 12.43, 23.73, 23.41
[21:47:56] <logmsgbot>	 !log twentyafterfour@tin Finished scap: group0 to 1.31.0-wmf.16 refs T180749 (duration: 38m 29s)
[21:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:10] <stashbot>	 T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749
[21:48:31] <twentyafterfour>	 note: this wasn't even group1 yet :-/
[21:48:47] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 18.79, 23.84, 23.63
[21:48:48] <twentyafterfour>	 still I guess logstash looks ok, I don't know what caused the api servers to get hit
[21:49:03] <twentyafterfour>	 coincidence I suppose
[21:49:35] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536
[21:49:37] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4)
[21:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4)
[21:51:57] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4)
[21:53:27] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.16
[21:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:19] <wikibugs>	 (03PS14) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665
[21:54:30] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.16 (duration: 01m 02s)
[21:54:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:00] <addshore>	 twentyafterfour: is that .16 on group1 then? :)
[21:57:07] <twentyafterfour>	 !log group1 looks stable. This concludes the MediaWiki train for today.
[21:57:09] <twentyafterfour>	 addshore: yep
[21:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:33] <addshore>	 twentyafterfour: awesome, yes, I see nothing that alarms me :)
[21:57:52] <twentyafterfour>	 ok cool
[21:59:45] <addshore>	 hmm, twentyafterfour I do see a couple of things now actually
[22:00:01] <addshore>	 https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval%3A('%24%24hashKey'%3A'object%3A1287'%2Cdisplay%3A'10%20seconds'%2Cpause%3A!f%2Csection%3A1%2Cvalue%3A10000)%2Ctime%3A(from%3Anow-15m%2Cmode%3Aquick%2Cto%3Anow))
[22:00:13] <addshore>	 bah, thats the wrong link
[22:00:21] <addshore>	 https://logstash.wikimedia.org/goto/74dfb80b01ae92a809b22eb9b430272a
[22:01:12] <addshore>	 however it doesn't immediately look critical 
[22:02:42] <addshore>	 I'll look at logstash again later or tomorrow and see if anything needs to happen. Off for now
[22:08:33] <twentyafterfour>	 thanks addshore
[22:52:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3891625 (10Andrew) OK -- in System Setup:Device Settings I see one nic with four ports:    Integrated NIC 1 Port 1: Intel(R) Ethernet 10G 4P X520/I350 rNDC -                24:6E:96:8D...
[22:58:54] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3891632 (10Krinkle)
[22:59:38] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3891635 (10Krinkle)
[22:59:47] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RFC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3891637 (10Krinkle)
[23:00:10] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RFC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10Krinkle)
[23:10:03] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#3891691 (10Krenair) Created a new system, ran into the problem that https://gerrit.wikimedia.org/r/#/c/403326/ fixes
[23:16:13] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3891713 (10kaldari) Pinging @MoritzMuehlenhoff. Please see my most recent comment above. Thanks!
[23:26:17] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:27:22] <wikibugs>	 (03PS5) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710
[23:28:19] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3891728 (10JoKalliauer) Pinging @kaldari . [[ https://commons.wikimedia.org/wiki/File:O_Canada_Lilypond.svg | File:O_Canada_Lilypond.svg ]] h...
[23:43:39] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9691/" [puppet] - 10https://gerrit.wikimedia.org/r/393710 (owner: 10Dzahn)
[23:46:58] <wikibugs>	 (03PS1) 10Krinkle: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903)
[23:49:18] <wikibugs>	 (03CR) 10Krinkle: "Need to decide where to split the thread." [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle)
[23:50:29] <wikibugs>	 (03PS2) 10Krinkle: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903)
[23:54:14] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3891777 (10kaldari)
[23:56:17] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[23:56:38] <wikibugs>	 (03PS4) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114)