[01:05:32] <wikibugs>	 (03PS45) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956)
[01:29:29] <wikibugs>	 (03PS1) 10Subramanya Sastry: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280)
[02:11:10] <wikibugs>	 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team, 10Chinese-Sites: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3899755 (10Shizhao)
[02:31:04] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 07m 50s)
[02:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 762.29 seconds
[03:55:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.11 seconds
[04:12:40] <wikibugs>	 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3899801 (10Liuxinyu970226) @Shizhao I'm sorry but Jcrespo hasn't decision about zhwiki here
[05:24:26] <wikibugs>	 (03CR) 10Legoktm: [C: 031] Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry)
[06:13:08] <marostegui>	 !log Deploy schema change on db1070 (s5 master) - T174569
[06:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:21] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:15:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899829 (10Marostegui)
[06:16:16] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899841 (10Marostegui) p:05Triage>03Normal
[06:16:45] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3899843 (10Marostegui) p:05Triage>03Normal
[06:29:33] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899844 (10Papaul) @Marostegui sorry but we don't have any used BBU from a decommissioned host that we can use . (we have no decommissioned HP servers)
[06:34:36] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3899845 (10Papaul) p:05Triage>03Normal
[06:35:34] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899858 (10Marostegui) Thanks @Papaul - I have checked the hosts that will soon be decommissioned and none of them are HP. @RobH any ideas on what can we do about this?
[06:36:55] <wikibugs>	 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3899860 (10Papaul) p:05Triage>03Normal
[06:37:29] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807)
[06:38:15] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807)
[06:40:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[06:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[06:42:36] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[06:44:13] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 to fix data drifts on the archive table - T162807 (duration: 01m 13s)
[06:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:26] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[06:52:15] <wikibugs>	 (03PS1) 10Marostegui: db1065.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/404237 (https://phabricator.wikimedia.org/T148507)
[06:52:17] <marostegui>	 !log Upgrade MariaDB on db1065
[06:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1065.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/404237 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui)
[07:00:21] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2110620
[07:01:36] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 031] Stop rewriting m.wikipedia.org and zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/404158 (https://phabricator.wikimedia.org/T69015) (owner: 10Mholloway)
[07:11:52] <marostegui>	 !log Deploy schema change on silver (labswiki) and labtestweb2001 (labtestwiki) - T174569
[07:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:05] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[07:22:38] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397)
[07:23:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui)
[07:24:23] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397)
[07:24:59] <wikibugs>	 (03PS3) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397)
[07:27:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui)
[07:28:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui)
[07:28:58] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui)
[07:30:29] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Replace db1063 with db1087 as vslow in s8 (duration: 01m 12s)
[07:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:00] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0
[07:40:40] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345
[07:41:54] <_joe_>	 !log disabling puppet in all of production before merging https://gerrit.wikimedia.org/r/402345
[07:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 (owner: 10Giuseppe Lavagetto)
[07:50:06] <_joe_>	 !log forcing puppet run on the puppetmasters to force pluginsync for function change
[07:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:55] <_joe_>	 !log reenabling puppet on all systems where it was previously enabled, after various testing
[07:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: hiera: port nuyaml to hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/402346
[08:11:39] <ema>	 !log lvs400[56]: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267
[08:11:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:52] <stashbot>	 T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656
[08:15:35] <moritzm>	 !log rebooting terbium for kernel security update
[08:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:55] <moritzm>	 terbium is back up
[08:20:21] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0
[08:22:56] <moritzm>	 !log rebooting bast1001 for kernel security update
[08:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:01] <moritzm>	 bast1001 is back up
[08:29:03] <marostegui>	 ºo/
[08:29:07] <marostegui>	 \o/
[08:29:32] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1945 bytes in 0.099 second response time
[08:38:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.40 seconds
[08:39:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.17 seconds
[08:40:47] <jynus>	 that looks bad
[08:41:13] <marostegui>	 it is the BBU
[08:41:25] <marostegui>	 and the raid policy is in WT
[08:41:32] <marostegui>	 https://phabricator.wikimedia.org/T184888
[08:42:27] <moritzm>	 !log reboot wezen for kernel security update
[08:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:55] <jynus>	 I want to RESET SLAVE ALL on db1031
[08:44:05] <jynus>	 ok with that?
[08:44:30] <marostegui>	 sounds good
[08:44:56] <jynus>	 !log disconnecting codfw -> eqiad replication for x1
[08:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:37] <jynus>	 is it just me or the load on x1 has increased, too
[08:47:51] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&var-port=9104
[08:49:18] <marostegui>	 but just now, no?
[08:49:21] <marostegui>	 (per that graph)
[08:50:00] <jynus>	 well, when it started lagging
[08:50:15] <jynus>	 and probably why it did
[08:51:57] <marostegui>	 the spike looks gone now
[08:52:05] <marostegui>	 (the write spike I mean)
[08:52:12] <jynus>	 cool, maybe slave catches up
[08:59:31] <_joe_>	 anyone looking at the wikidata dispatch lag?
[08:59:51] <jynus>	 no, we were busy with the hw issue
[09:00:05] <_joe_>	 yeah not you, I meant other opsens 
[09:00:47] <jynus>	 pattern not found
[09:00:54] <marostegui>	 jynus: we can force db2033 to be WB, but I would prefer not to do so
[09:00:57] <jynus>	 that looks like a configuration/content problem
[09:00:57] <_joe_>	 yeah scratch that message
[09:01:17] <_joe_>	 jynus: nope, it's not. It's lagging, the alert is misleading
[09:01:32] <jynus>	 marostegui: let's not do it if it doesn't create user problems
[09:01:45] <moritzm>	 the wikidata.org lag is probably due to the terbium reboot, it probably needs to catch up
[09:01:53] <marostegui>	 Agreed, let's see if r*bh has some ideas about how we can replace the BBU
[09:01:54] <jynus>	 ah, that would explain it
[09:02:18] <_joe_>	 moritzm: I'd expect it would in more than half an hour
[09:03:23] <jynus>	 did terbium boot back up, how long ago?
[09:03:27] <moritzm>	 not sure, Amir mentioned that less than 10 mins of non-availability of terbium should be fine and the reboot took maybe two
[09:03:50] <moritzm>	 jynus: yeah, it came back up at 8:18 UTC
[09:03:56] <moritzm>	 and took maybe two mins
[09:04:25] <jynus>	 then the check is probably not well puppetized/depending on somthing stateful
[09:05:14] <jynus>	 not the check, probably the dispatch itself
[09:05:20] <jynus>	 https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=5&fullscreen&orgId=1
[09:05:53] <jynus>	 if that is the load of terbium probably not very reliable
[09:06:09] <jynus>	 https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1 is worrying
[09:06:29] <jynus>	 Amir1 ^
[09:06:44] <Amir1>	 I just got here
[09:06:56] <Amir1>	 sorry let me take a took
[09:07:27] <elukey>	 !log reboot aqs1004 for kernel updates
[09:07:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:19] <Amir1>	 jynus: It's definitely related to terbium restart
[09:08:46] <Amir1>	 I think we have the similar problem like the last time, the way to fix it is to remove the lock in redis
[09:08:52] <jynus>	 there is 4 crons running
[09:08:58] <jynus>	 maybe blocked?
[09:09:26] <_joe_>	 Amir1: this is ridiculous and needs to be fixed as a UBN ticket
[09:09:34] <_joe_>	 Amir1: what lock on redis?
[09:09:43] <_joe_>	 and yes, the lag is indeed very high
[09:09:54] <hashar>	 !log upgrading Zuul on contint1001 | https://gerrit.wikimedia.org/r/#/c/356181/
[09:10:00] <Amir1>	 let me fix it ASAP
[09:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:09] <_joe_>	 Amir1: which lock on redis?
[09:10:16] <jynus>	 I am going to guess it is the think it used to be on the wikidata masters and I asked it to be removed
[09:10:23] <_joe_>	 we're creating endless locks?
[09:10:38] <Amir1>	 jynus: exactly 
[09:10:54] <Amir1>	 _joe_: no, they have expiry but I think it's 2 hours
[09:11:02] <_joe_>	 jynus: the problem is not the storage medium, it's the idea of an endless lock
[09:11:03] <Amir1>	 let me check config
[09:11:15] <hashar>	 !log upgrading Zuul on contint2001 (zuul-merger) | https://gerrit.wikimedia.org/r/#/c/356181/
[09:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:29] <_joe_>	 Amir1: the real problem here, imho, is that we're still running this thing as concurrent crons
[09:13:36] <_joe_>	 instead than as a proper service
[09:14:10] <Amir1>	 _joe_: would love to help getting it done
[09:14:16] <_joe_>	 if you had a proper service, with shutdown handlers, it would've removed the lock and stop on a reboot
[09:14:29] <jynus>	 ok, let's focus on the current issue
[09:14:42] <jynus>	 is there something we can do to fix it now?
[09:14:57] <Amir1>	 the only thing for me is the place it should go, otherwise I think it's not that hard to implement 
[09:15:00] <_joe_>	 jynus: I think the current issue is known and Amir1 has a solution at hand
[09:15:10] <jynus>	 do you?
[09:15:24] <Amir1>	 jynus: two ways: 1- we do nothing until the lock expire 
[09:15:37] <Amir1>	 2- clean the locks from redis lock manager
[09:15:52] <jynus>	 can we do #2?
[09:15:54] <Amir1>	 I'm looking in more depth
[09:16:06] <Amir1>	 jynus: yup, done it before
[09:16:10] <Amir1>	 addshore did
[09:19:09] <Amir1>	 I also can do it, just need to find the related mediawiki code to do in eval.php
[09:20:21] <jynus>	 something seems to be being executed looking at the graphs
[09:20:50] <Amir1>	 yeah but since most of them are locked, they think, there is nothing to do
[09:24:07] <Amir1>	 _joe_: can you flush out all keys in redis that start with 'Wikibase.wikidatawiki.dispatchChanges'?
[09:24:14] <addshore>	 *looks up*
[09:24:35] <Amir1>	 It seems the locks are getting expired 
[09:24:43] <_joe_>	 Amir1: I think you can, yes, but since I do see some changes being dispatched, I'd first need to kill the applications
[09:24:44] <addshore>	 Yeh, I have some pending changes to change the lock ttl
[09:24:51] <addshore>	 And change the lock manager
[09:25:11] <addshore>	 Amir1: I need to get 3 or so patches merged in wikibase first
[09:25:15] <_joe_>	 addshore: which redis servers does this connect to?
[09:25:24] <addshore>	 *opens laptop*
[09:25:41] <addshore>	 The lack manager is defined in filebacked.php
[09:25:44] <addshore>	 *lock
[09:25:46] <_joe_>	 because it seems it doesn't go via nutcracker
[09:26:04] <_joe_>	 oh you mean the traditional redis lock manager in mediawiki
[09:26:08] <Amir1>	 https://github.com/wikimedia/operations-mediawiki-config/blob/1290e7fcffd6dd8834ee1a85a378aa3646a88e6a/wmf-config/filebackend.php
[09:26:10] <_joe_>	 if only I knew :P
[09:26:11] <Amir1>	 yup
[09:26:36] <Amir1>	 we *abuse* file lock manager for dispatching
[09:26:53] <addshore>	 I have patches up to switch to a different lock manager with a different ttl
[09:27:12] <addshore>	 not sure if we have a ticket for it as I was just doing it because I thought it should be done
[09:27:26] <Amir1>	 _joe_: it seems it's getting back up: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1
[09:27:43] <jynus>	 honestly, at this point I would leave it as it is
[09:27:46] <Amir1>	 Should we let it recover naturally or flush out every thing?
[09:27:52] <_joe_>	 yeah me too
[09:27:54] <jynus>	 it could create an overload on the databases?
[09:28:05] <_joe_>	 I'm seeing quite a few thread able to work
[09:28:08] <jynus>	 if it starts applying changes too fast
[09:28:13] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1939 bytes in 0.072 second response time
[09:28:20] <addshore>	 https://gerrit.wikimedia.org/r/#/c/395967 https://gerrit.wikimedia.org/r/#/c/395969 https://gerrit.wikimedia.org/r/397535 https://gerrit.wikimedia.org/r/397536
[09:28:36] <_joe_>	 addshore: I'd be more interested in transforming this in a proper production service
[09:28:40] <jynus>	 the other issue is the alert- something went wrong on the check
[09:28:48] <_joe_>	 instead of a list of ever-overlapping cron scripts
[09:28:51] <Amir1>	 jynus: I highly doubt that, at this time, changes are too little that even they held up they don't create much issue 
[09:28:56] <_joe_>	 jynus: what went wrong?
[09:29:01] <jynus>	 maybe it checks redis and it couldn't get the info if nothing is happening?
[09:29:06] <addshore>	 _joe_: me too, this was just to avoid us having these locks occasionally last for 2 hours and confuse everyone :)
[09:29:14] <_joe_>	 eheh ok
[09:29:15] <jynus>	 _joe_: the graphs kept working
[09:29:23] <_joe_>	 jynus: what didn't work, sorry?
[09:29:25] <jynus>	 showing the right dispatch
[09:29:35] <jynus>	 but the check for wikidata dispatch didn't say "high lag"
[09:29:37] <_joe_>	 the "pattern not found" is what the alert, written like it is now
[09:29:45] <jynus>	 it said taht
[09:29:49] <_joe_>	 will show you when the lag is > than 300 seconds
[09:29:57] <_joe_>	 it's by design, it's not broken
[09:30:02] <jynus>	 really?
[09:30:03] <_joe_>	 one can say it's a poor design
[09:30:17] <_joe_>	 yes, it uses check_http under the hood
[09:30:24] <jynus>	 I don't get it
[09:30:33] <jynus>	 what pattern are we talking?
[09:31:01] <_joe_>	 jynus: it requests https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics
[09:31:18] <jynus>	 ok, then the problem is the response there
[09:31:18] <_joe_>	 and then parses it with a regex to see if the lag is 300 seconds or less
[09:31:22] <jynus>	 not the check
[09:31:33] <_joe_>	 if it's not, say it's 400, it says "pattern not found"
[09:31:43] <_joe_>	 no, the problem is the check that is abysmal
[09:31:53] <jynus>	 the response shoud say something more explicit
[09:32:01] <_joe_>	 --ereg '"median":[^}]*"lag":([\
[09:32:03] <_joe_>	 1-2]?[0-9]?[0-9]|300),'
[09:32:08] <_joe_>	 this is what it does
[09:32:11] <_joe_>	 sigh :P
[09:32:12] <jynus>	 in fact, there is not a reason why it shouldn't show the dispatch lag
[09:32:22] <jynus>	 grafana does
[09:32:34] <jynus>	 we could even set an alert based on grafana
[09:32:39] <jynus>	 you know what I mean?
[09:32:53] <addshore>	 hmm, I thought it was based on grafana...
[09:32:55] <elukey>	 !log reboot kafka2001 for kernel updates
[09:33:05] <addshore>	 at least, when I wrote the original check it was based on graphite data
[09:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:09] <Amir1>	 jynus: you can make an alert on grafana
[09:33:12] <jynus>	 http_check may be doing the right thing, but the response/check don't
[09:33:14] <Amir1>	 that seems easier
[09:33:31] <jynus>	 so, as a follow up
[09:33:41] <jynus>	 and this is only my opinion
[09:33:54] <jynus>	 I would like to see 2 tickets, one about app architecture
[09:34:03] <addshore>	 oh wait, or was that for wdqs lag.. hmmm
[09:34:08] <jynus>	 and get some ops involved
[09:34:16] <jynus>	 and the other about the check
[09:34:26] <addshore>	 jynus: I believe there are already tickets about the dispatching architecture / overhaul, and there have been for some time
[09:34:32] <jynus>	 but this is just the optinion of someone that as very little involvement
[09:34:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100)
[09:35:09] <jynus>	 well, I complained when it was on the wikidata masters, but I am not 100% sure this is better
[09:35:15] <jynus>	 also, it needs some redundancy
[09:35:36] <addshore>	 https://phabricator.wikimedia.org/T178652 is the ticket regarding the current lock manager timeout
[09:35:42] <jynus>	 terbium is a SPOF- we use it for obvious maintenance
[09:36:13] <jynus>	 and we can put it down and set another server, but that is thought for things like generating special pages
[09:36:31] <jynus>	 where time is not a huge issue
[09:36:52] <addshore>	 Here is an epic covering dispatching in general https://phabricator.wikimedia.org/T108944
[09:37:33] <addshore>	 Parent ticket about using the jobqueue instead of the current system https://phabricator.wikimedia.org/T48643
[09:37:40] <jynus>	 addshore: I do not think that covers everyhing I said, specially architecture and redudancy
[09:38:02] <jynus>	 that is software concerns, mine are mostly related to arquitecture and hardware
[09:38:11] <wikibugs>	 (03PS4) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311
[09:38:31] <addshore>	 so architecture not as in software architecture?
[09:39:19] <jynus>	 e.g. where should we run those from? how to setup a standalone service? how to provide high availability? how to improve alerts?
[09:40:15] <addshore>	 well, if it uses the jobqueue most of that is redundant? it would need to be a standalone service, it would run from the job runners, alerts, sure, could probably be improved, but there is not much to alert about than high lag
[09:40:51] <marostegui>	 You guys think I can deploy an schema change on wikidata on codfw (not active dc)? or shall I wait?
[09:41:37] <Amir1>	 Everything is back to normal AFAIK
[09:42:31] <Amir1>	 I like using jobqueue as it reduces the hassle of dispatching for third parties (=my localhost)
[09:43:16] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[09:44:04] <marostegui>	 Amir1: Thanks - I will deploy then
[09:45:14] <marostegui>	 !log Deploy schema change on s8 codfw master (db2045) with replication (this will generate lag on s8 codfw) - T174569
[09:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:27] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[09:46:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[09:51:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "do not merge until we use hiera 3.x in production as well." [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto)
[09:52:54] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807)
[09:55:42] <wikibugs>	 (03CR) 10Volans: "Few minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani)
[09:55:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:57:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:57:19] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:58:25] <elukey>	 !log rolling reboots of aqs hosts (1005->1009) for kernel updates 
[09:58:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:48] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T162807 (duration: 01m 09s)
[09:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:01] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[10:06:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please just limit your work to adding the new servers. Reordering can be done in a logical way at a later time." [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey)
[10:06:26] <wikibugs>	 (03CR) 10Ema: [C: 032] vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 (owner: 10Ema)
[10:06:33] <wikibugs>	 (03PS5) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311
[10:06:36] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 (owner: 10Ema)
[10:08:27] <wikibugs>	 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900022 (10fgiunchedi) Given that it isn't that many metrics, I think it might be simpler to keep the standard jmx exporter configuration on the puppetdb side and drop the metri...
[10:12:16] <wikibugs>	 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900057 (10elukey) >>! In T184796#3900022, @fgiunchedi wrote: > Given that it isn't that many metrics, I think it might be simpler to keep the standard jmx exporter configuratio...
[10:12:45] <icinga-wm>	 PROBLEM - proxysql processes on terbium is CRITICAL: PROCS CRITICAL: 0 processes with command name proxysql
[10:12:55] <wikibugs>	 (03PS4) 10Elukey: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519)
[10:15:46] <moritzm>	 !log reboot wasat for kernel security update
[10:15:51] <jynus>	 proxysql probably doesn't start automaticaly after restart
[10:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:02] <jynus>	 as it is a test install
[10:16:36] <jynus>	 !log start proxysql on terbium
[10:16:45] <icinga-wm>	 RECOVERY - proxysql processes on terbium is OK: PROCS OK: 1 process with command name proxysql
[10:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:10] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: switchover s2 codfw master from db2017 to db2035 [puppet] - 10https://gerrit.wikimedia.org/r/404268 (https://phabricator.wikimedia.org/T176243)
[10:20:25] <wikibugs>	 (03PS1) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796)
[10:20:44] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243)
[10:20:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey)
[10:22:50] <jynus>	 !log starting codfw s2 master switchover
[10:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: switchover s2 codfw master from db2017 to db2035 [puppet] - 10https://gerrit.wikimedia.org/r/404268 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[10:27:51] <jynus>	 !log upgrade and restart db2035
[10:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM generally" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn)
[10:37:46] <wikibugs>	 (03PS9) 10Ema: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles)
[10:38:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[10:38:32] <wikibugs>	 (03CR) 10Ema: [C: 032] Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles)
[10:39:54] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[10:40:07] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[10:40:48] <wikibugs>	 (03PS2) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796)
[10:44:12] <icinga-wm>	 RECOVERY - HHVM processes on labweb1001 is OK: PROCS OK: 6 processes with command name hhvm
[10:46:41] <moritzm>	 ^labweb is nme
[10:48:47] <hashar>	 !log Upgrading zuul to 2.5.1 on contint1001 / contint2001
[10:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:01] <wikibugs>	 (03PS3) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796)
[10:50:03] <elukey>	 !log reboot kafka2002 for kernel updates
[10:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey)
[10:51:02] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Switchover codfw s2 master from db2017 to db2035 (duration: 01m 12s)
[10:51:03] <hashar>	 !log Upgrading zuul to 2.5.1 on contint1001 / contint2001 | T158243
[10:51:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274
[10:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:23] <stashbot>	 T158243: Update zuul to upstream master - https://phabricator.wikimedia.org/T158243
[10:51:27] <jynus>	 !log s2 codfw master swithover finished
[10:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:57] <gehel>	 !log lowering disk watermark on elasticsearch eqiad to shuffle shards around
[10:52:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui)
[10:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:33] <icinga-wm>	 PROBLEM - DPKG on graphite1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[10:54:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui)
[10:55:41] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T162807 (duration: 01m 12s)
[10:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:51] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[10:56:38] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546)
[10:56:47] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui)
[10:57:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276
[10:58:01] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276
[11:00:04] <jouncebot>	 jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:48] <wikibugs>	 (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[11:02:17] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[11:02:29] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[11:04:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: don't consider sde for restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404277 (https://phabricator.wikimedia.org/T184100)
[11:04:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: don't consider sde for restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404277 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[11:08:06] <godog>	 !log bootstrap cassandra-a on restbase1017 
[11:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:41] <logmsgbot>	 !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 01m 14s)
[11:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:53] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[11:10:56] <logmsgbot>	 !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 01m 14s)
[11:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:10] <icinga-wm>	 RECOVERY - DPKG on restbase1011 is OK: All packages OK
[11:11:10] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[11:11:20] <icinga-wm>	 RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[11:11:20] <icinga-wm>	 RECOVERY - dhclient process on restbase1011 is OK: PROCS OK: 0 processes with command name dhclient
[11:11:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on restbase1011 is OK: OK: nf_conntrack is 9 % full
[11:11:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[11:11:30] <icinga-wm>	 RECOVERY - configured eth on restbase1011 is OK: OK - interfaces up
[11:11:30] <icinga-wm>	 RECOVERY - MD RAID on restbase1011 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0
[11:11:40] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on restbase1011 is OK: OK ferm input default policy is set
[11:11:44] <godog>	 that's me ^
[11:11:50] <icinga-wm>	 RECOVERY - Disk space on restbase1011 is OK: DISK OK
[11:11:50] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1011 is OK: OK - cassandra-a is active
[11:11:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active
[11:16:40] <icinga-wm>	 RECOVERY - DPKG on graphite1002 is OK: All packages OK
[11:18:51] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on restbase1011 is OK: OK: synced at Mon 2018-01-15 11:18:49 UTC.
[11:21:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Scap: bump version to 3.7.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/404219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[11:21:23] <godog>	 !log upload scap 3.7.6-1 - T127762
[11:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:39] <stashbot>	 T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762
[11:21:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Scap: bump version to 3.7.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/404219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[11:22:30] <icinga-wm>	 RECOVERY - IPMI Sensor Status on restbase1011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[11:26:14] <wikibugs>	 (03PS1) 10Ema: varnishslowlog: do not crash on empty reqheader values [puppet] - 10https://gerrit.wikimedia.org/r/404279
[11:29:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey)
[11:29:32] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey)
[11:30:10] <wikibugs>	 (03CR) 10Ema: [C: 032] varnishslowlog: do not crash on empty reqheader values [puppet] - 10https://gerrit.wikimedia.org/r/404279 (owner: 10Ema)
[11:31:11] <_joe_>	 grr
[11:31:21] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey)
[11:32:16] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900253 (10elukey) Started a dashboard in https://grafana-admin.wikimedia.org/dashboard/db/puppetdb
[11:32:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Set IPv6 accept_ra to 2 [puppet] - 10https://gerrit.wikimedia.org/r/404281
[11:33:23] <wikibugs>	 (03PS3) 10Volans: wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702)
[11:34:43] <wikibugs>	 (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans)
[11:36:47] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw13(3[8-9]|4[0-9]).*
[11:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "That's not the only issue, still investigating" [puppet] - 10https://gerrit.wikimedia.org/r/404281 (owner: 10Alexandros Kosiaris)
[11:42:20] <icinga-wm>	 PROBLEM - MegaRAID on labsdb1003 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
[11:43:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900277 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1338.eqiad.wmnet ``` The log can be fo...
[11:45:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-gl: Cleanup [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/403116 (owner: 10KartikMistry)
[11:45:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-ca: Cleanup [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/403109 (owner: 10KartikMistry)
[11:45:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-es: Cleanup [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/403107 (owner: 10KartikMistry)
[11:45:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-ca: Cleanup [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/403106 (owner: 10KartikMistry)
[11:48:59] <elukey>	 !log reboot aqs1007 for kernel upgrades
[11:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:19] <elukey>	 ah already logged a rolling reboot, amending the sal
[11:51:14] <wikibugs>	 (03PS1) 10Ema: varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282
[11:51:43] <wikibugs>	 (03CR) 10Gilles: [C: 031] varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282 (owner: 10Ema)
[11:51:55] <wikibugs>	 10Operations, 10Parsoid, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3900319 (10Deskana)
[11:52:14] <wikibugs>	 (03CR) 10Ema: [C: 032] varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282 (owner: 10Ema)
[11:54:11] <moritzm>	 !log rebooting ores1* for kernel security update
[11:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:37] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui)
[11:56:03] <wikibugs>	 (03CR) 10Paladox: [C: 031] "You will want to update to gerrit 2.14.7 :)." [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad)
[11:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui)
[11:57:12] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui)
[11:58:59] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on labsdb1003 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Marostegui Will be decommissioned soon https://phabricator.wikimedia.org/T184832
[11:59:43] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 01m 12s)
[11:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:54] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[12:04:47] <jynus>	 !log upgrade and restart db2017
[12:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:40] <icinga-wm>	 PROBLEM - proxysql processes on wasat is CRITICAL: PROCS CRITICAL: 0 processes with command name proxysql
[12:14:40] <icinga-wm>	 RECOVERY - proxysql processes on wasat is OK: PROCS OK: 1 process with command name proxysql
[12:20:49] <wikibugs>	 (03PS1) 10Hashar: Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283
[12:20:51] <wikibugs>	 (03PS1) 10Hashar: Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338)
[12:25:52] <wikibugs>	 (03CR) 10Hashar: [C: 032] Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar)
[12:26:13] <wikibugs>	 (03CR) 10Hashar: [C: 032] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar)
[12:26:29] <wikibugs>	 (03CR) 10Hashar: [C: 032] Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar)
[12:26:34] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar)
[12:30:43] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507)
[12:33:09] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3900380 (10akosiaris) >>! In T178690#3891946, @Dzahn wrote: > We need the following new dashboards / URLs (noticed as part of T183873): >  > - service cluster A overview (...
[12:33:42] <icinga-wm>	 PROBLEM - MD RAID on mw1338 is CRITICAL: Return code of 255 is out of bounds
[12:35:47] <volans>	 new host, silencing
[12:51:03] <icinga-wm>	 PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:52:47] <wikibugs>	 (03PS1) 10Gilles: Fix varnishslowlog logstash configuration [puppet] - 10https://gerrit.wikimedia.org/r/404288
[13:20:19] <elukey>	 !log reboot kafka2003 for kernel upgrades
[13:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:03] <icinga-wm>	 RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:24:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900465 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1338.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1338.eqiad.wmnet'] ```
[13:26:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900473 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1338.eqiad.wmnet ``` The log can be fo...
[13:29:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3900477 (10Marostegui) Maybe we should force this host to be WB even without the BBU to make sure it catches up: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2F...
[13:30:31] <wikibugs>	 (03CR) 10Hashar: [C: 032] "recheck" [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar)
[13:31:36] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team: labvirt1021-1022 spam the dhcp server with requests - https://phabricator.wikimedia.org/T184909#3900478 (10Joe)
[13:31:38] <wikibugs>	 (03Merged) 10jenkins-bot: Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar)
[13:35:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:36:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 79216 bytes in 0.296 second response time
[13:38:00] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] "recheck" [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar)
[13:38:48] <elukey>	 !log reboot eventlog1001 for kernel updates 
[13:38:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar)
[13:41:43] <gehel>	 !log starting rolling reboot of elasticsearch / cirrus eqiad for kernel upgrade
[13:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:02] <icinga-wm>	 RECOVERY - Long running screen/tmux on restbase1011 is OK: OK: No SCREEN or tmux processes detected.
[13:45:12] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900500 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['mw1339.eqiad.wmnet', 'mw1340.eqiad.wmn...
[13:52:04] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10MoritzMuehlenhoff)
[13:52:26] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900512 (10MoritzMuehlenhoff)
[13:55:18] <zeljkof>	 hm, mwlog1001 says: 462 data error in /srv/mediawiki/php-1.31.0-wmf.15/extensions/Graph/includes/ApiGraph.php on line 125
[13:56:29] <wikibugs>	 (03CR) 10Ema: [C: 032] Fix varnishslowlog logstash configuration [puppet] - 10https://gerrit.wikimedia.org/r/404288 (owner: 10Gilles)
[13:59:02] <wikibugs>	 (03PS1) 10Ladsgroup: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1400).
[14:00:04] <jouncebot>	 kart_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:10] <zeljkof>	 o/
[14:00:16] <zeljkof>	 I can SWAT today
[14:00:23] <wikibugs>	 (03PS2) 10Hashar: Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338)
[14:00:32] <zeljkof>	 kart_: around for SWAT?
[14:00:48] <Amir1>	 zeljkof: I added one thing to the swat, not testable :D
[14:01:37] <zeljkof>	 Amir1: looking...
[14:01:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar)
[14:02:42] <zeljkof>	 Amir1: ok, so I deploy the patch, no mwdebug?
[14:02:54] <zeljkof>	 I can do it while waiting for kart_ to come :)
[14:02:59] <Amir1>	 yup
[14:03:06] <zeljkof>	 Amir1: or do you want to deploy yourself?
[14:03:25] <Amir1>	 zeljkof: I can :)
[14:03:36] <zeljkof>	 Amir1: please do then, go ahead
[14:03:44] <Amir1>	 coool
[14:03:47] <zeljkof>	 I'll deploy kart_'s commit when he comes
[14:04:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup)
[14:05:35] <moritzm>	 !log reboot rdb* hosts in codfw for kernel security update
[14:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:50] <wikibugs>	 (03Merged) 10jenkins-bot: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup)
[14:06:54] <wikibugs>	 (03CR) 10jenkins-bot: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup)
[14:07:26] <kart_>	 zeljkof: here now.
[14:07:38] <kart_>	 Sorry for delay.
[14:07:50] <zeljkof>	 kart_: no problem, Amir1 is deploying, you are next, in a few minutes
[14:08:06] <Amir1>	 Mine is about to finish 
[14:08:46] <zeljkof>	 kart_: +2d 404070, waiting for CI, will ping you when the commit is at mwdebug1002 (in a few minutes)
[14:08:53] <logmsgbot>	 !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404298|Enable lua fine grained usage tracking in some wikis (T184322)]] (duration: 01m 14s)
[14:08:58] <kart_>	 OK!
[14:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:04] <stashbot>	 T184322: Enable fine grained lua tracking gradually in client wikis - https://phabricator.wikimedia.org/T184322
[14:09:29] <Amir1>	 Mine is done, just need to monitor it from now on
[14:09:50] <zeljkof>	 Amir1: ok, thanks, I will take over SWAT then
[14:10:05] <zeljkof>	 kart_: forgot to ask, do you want to deploy your commit yourself?
[14:10:39] <kart_>	 zeljkof: no. go ahead :D
[14:11:18] <zeljkof>	 kart_: sure, just asking, if you would like to deploy in the future, let me know, it is not black magic :)
[14:12:07] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team: labvirt1021-1022 spam the dhcp server with requests - https://phabricator.wikimedia.org/T184909#3900561 (10aborrero) p:05Triage>03Normal
[14:12:27] <kart_>	 zeljkof: yes. I know! Just bit noisy here so don't want to messup something.
[14:13:20] <zeljkof>	 kart_: no problem, that's what #releng is for :) we mess up things regardless of the noise
[14:13:47] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900568 (10aborrero) p:05Triage>03High
[14:14:47] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10aborrero) Should this be merged somehow into T184189 ?
[14:14:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/404053 (owner: 10Dzahn)
[14:15:58] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875329 (10aborrero) @MoritzMuehlenhoff reports in T184910 that there are servers just pending the reboot. Should that t...
[14:19:57] <zeljkof>	 kart_: the commit is at mwdebug1002, any estimate of how much time you need to test it?
[14:20:30] <kart_>	 zeljkof: 2 or 3 min.
[14:20:46] <zeljkof>	 kart_: great
[14:20:53] <zeljkof>	 let me know if I can deploy
[14:23:49] <kart_>	 zeljkof: testwiki is wmf16, right?
[14:24:03] <zeljkof>	 let me check...
[14:24:43] <zeljkof>	 https://tools.wmflabs.org/versions/ says 1.31.0-wmf.16
[14:25:16] <kart_>	 ah. Checking again.
[14:26:16] <kart_>	 zeljkof: did you sync all files?
[14:26:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100)
[14:27:07] <zeljkof>	 kart_: I ran `scap pull` at mwdebug1002, if that's what you are asking
[14:27:44] <kart_>	 okay. Wondering why patch has no effect yet.
[14:27:48] <zeljkof>	 the commit should be only at mwdebug1002, if that was the question
[14:27:59] <zeljkof>	 I did not deploy anywhere else yet
[14:28:29] <zeljkof>	 did you use the x-wikimedia-debug extension to test at mwdebug1002?
[14:29:26] <kart_>	 zeljkof: yes. I use that.
[14:29:31] <kart_>	 as usual.
[14:30:42] <zeljkof>	 kart_: I have just checked, I have ran all commands, the commit should be at mwdebug1002, I can't find any mistake I could have made
[14:31:06] <kart_>	 zeljkof: OK. Then let me try again, if that doesn't work, we will abandon patch.
[14:31:16] <kart_>	 It is not working as expected.
[14:31:34] <zeljkof>	 kart_: should I revert the patch?
[14:31:46] <zeljkof>	 (since it is already merged)
[14:32:04] <kart_>	 zeljkof: wait. Checking Nikerabbit too.
[14:32:24] <kart_>	 zeljkof: no other patches to SWAT, right?
[14:32:34] <kart_>	 So, we can take sometime to debug :)
[14:32:37] <icinga-wm>	 PROBLEM - DPKG on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:32:37] <icinga-wm>	 PROBLEM - dhclient process on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:32:51] <zeljkof>	 kart_: this is the only patch left, so there is time :)
[14:34:27] <icinga-wm>	 PROBLEM - Disk space on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:34:27] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1340 is CRITICAL: Host mw1340 is not in mediawiki-installation dsh group
[14:36:07] <icinga-wm>	 PROBLEM - HHVM processes on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:36:07] <icinga-wm>	 PROBLEM - nutcracker port on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:37:28] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:37:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1340 is CRITICAL: connect to address 10.64.32.52 and port 80: Connection refused
[14:37:47] <icinga-wm>	 PROBLEM - nutcracker process on mw1340 is CRITICAL: Return code of 255 is out of bounds
[14:38:18] <volans>	 mw1340 is a new host, silencing
[14:38:23] <kart_>	 zeljkof: we're good.
[14:38:30] <kart_>	 zeljkof: cache is the culprit.
[14:39:04] <zeljkof>	 kart_: ok to deploy?
[14:39:19] <kart_>	 yes.
[14:39:30] <zeljkof>	 kart_: deploying
[14:40:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "This actually doesn't compile, see https://puppet-compiler.wmflabs.org/compiler02/9725/" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[14:40:42] <logmsgbot>	 !log zfilipin@tin Synchronized php-1.31.0-wmf.16/extensions/ContentTranslation: SWAT: [[gerrit:404070|CX1: Fix translation view UI overlaps (T184662 T184130)]] (duration: 01m 16s)
[14:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:55] <stashbot>	 T184130: [wmf.15-regression] ContentTranslation page: the additional message not displayed correctly  - https://phabricator.wikimedia.org/T184130
[14:40:55] <stashbot>	 T184662: [wmf.16 - regression] Cannot  click Personal draft button - https://phabricator.wikimedia.org/T184662
[14:41:18] <zeljkof>	 kart_: deployed! please check and thanks for deploying with #releng! ;)
[14:41:33] <kart_>	 zeljkof: Thanks a lot again!
[14:41:45] <zeljkof>	 !log EU SWAT finished
[14:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100)
[14:46:17] <wikibugs>	 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900673 (10akosiaris)
[14:46:28] <jynus>	 !log upgrade and restart db2036
[14:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:02] <jynus>	 lots of test wiki errors
[14:48:18] <jynus>	 at 14:25
[14:48:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[14:48:38] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507)
[14:51:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[14:51:57] <wikibugs>	 (03PS3) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100)
[14:52:16] <wikibugs>	 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900700 (10akosiaris)
[14:52:31] <wikibugs>	 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900713 (10akosiaris)
[14:52:33] <wikibugs>	 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900712 (10akosiaris)
[14:54:49] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507)
[14:57:33] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507)
[14:57:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[14:58:05] <jynus>	 !log upgrade and restart db2043
[14:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:10] <icinga-wm>	 PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[cassandra-c]
[15:01:49] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1017 is CRITICAL: NRPE: Command check_cassandra-c-state not defined
[15:02:31] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db2050 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404302 (https://phabricator.wikimedia.org/T148507)
[15:07:48] <godog>	 yes yes
[15:08:10] <jynus>	 !log upgrade and restart db2050
[15:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db2050 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404302 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[15:11:59] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.131 and port 9042: Connection refused
[15:15:29] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1017 is CRITICAL: NRPE: Command check_cassandra-b-state not defined
[15:17:09] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.132 and port 9042: Connection refused
[15:18:50] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:19:29] <wikibugs>	 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900773 (10akosiaris)
[15:22:00] <wikibugs>	 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900786 (10akosiaris)
[15:22:31] <wikibugs>	 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900799 (10akosiaris)
[15:22:33] <wikibugs>	 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900798 (10akosiaris)
[15:23:21] <wikibugs>	 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3883571 (10akosiaris)
[15:23:23] <wikibugs>	 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900800 (10akosiaris)
[15:23:31] <wikibugs>	 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900773 (10akosiaris) p:05Triage>03Normal
[15:23:39] <wikibugs>	 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3883571 (10akosiaris) p:05Triage>03Normal
[15:23:49] <wikibugs>	 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900700 (10akosiaris) p:05Triage>03Normal
[15:23:58] <wikibugs>	 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900786 (10akosiaris) p:05Triage>03Normal
[15:25:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::base: configure apt before installing any package [puppet] - 10https://gerrit.wikimedia.org/r/404304
[15:25:58] <_joe_>	 volans: ^^
[15:31:05] <wikibugs>	 (03CR) 10Faidon Liambotis: apt: unattended-upgrades: add targetted upgrades script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[15:33:30] <jynus>	 !log upgrade and restart db2057
[15:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::base: run the apt configuration before anything else [puppet] - 10https://gerrit.wikimedia.org/r/404305
[15:41:33] <paravoid>	 _joe_: I think it's like the 3rd time this has been proposed :P
[15:41:43] <paravoid>	 I think you've proposed it before too!
[15:41:45] <paravoid>	 it won't work
[15:42:33] <_joe_>	 paravoid: in reality, it can work with some grease around the wheels, and we're having a pretty serious bug during the first installation we need to fix
[15:42:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900879 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1338.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1338.eqiad.wmnet'] ```
[15:42:56] <paravoid>	 not sure what you mean with grease
[15:43:00] <_joe_>	 paravoid: in the past I asked why we weren't doing it :)
[15:43:01] <paravoid>	 but as it is, it will just loop
[15:43:10] <paravoid>	 dependency loops
[15:43:38] <_joe_>	 I mean separating concerns between what needs to be configured before we try to download any package, and what can be configured just before a specific package
[15:44:48] <_joe_>	 else we'll be unable to download packages from security.debian.org for most of the first puppet run
[15:44:56] <jynus>	 !log upgrade and restart db2074
[15:45:04] <paravoid>	 not sure why?
[15:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:52] <volans>	 paravoid: the apt::conf and apt::pin in class apt() are not executed at the start as they should be, but way down in the first puppet run
[15:46:03] <paravoid>	 but security.d.o is being set up by d-i
[15:46:04] <_joe_>	 paravoid: the package resource for the provider "apt" apparently depends on /etc/apt/apt.conf in puppet 4
[15:46:09] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153)
[15:46:12] <_joe_>	 so it gets removed early in the installation
[15:46:23] <_joe_>	 and we don't have the proxy config anymore
[15:46:24] <akosiaris>	 mr1 down ?
[15:46:32] <_joe_>	 seems so, just ipv6
[15:46:39] <paravoid>	 (and just OOB IPv6 at that)
[15:46:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807)
[15:46:47] <paravoid>	 what gets removed and what removes it?
[15:47:29] <_joe_>	 paravoid: /etc/apt/apt.conf
[15:47:29] <akosiaris>	 indeed OOB only but not just IPv6, it's IPv4 as well
[15:47:42] <_joe_>	 which contains the proxy setting
[15:47:55] <_joe_>	 as for what removes it, it is a file resource inside the apt class
[15:48:19] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:20] <_joe_>	 now I think something is made depend on it, and I think it's the package resource with the debian provider
[15:48:27] <_joe_>	 but I have to confirm reading the sources
[15:48:51] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[15:49:08] <paravoid>	 we basically just need to order File['/etc/apt/apt.conf'] -> Apt::Conf (spaceship or all resources) as I see it
[15:49:23] <_joe_>	 the other way around maybe?
[15:49:33] <_joe_>	 you want the other confs to happen before you remove it
[15:49:40] <volans>	 we have an ensure absent for /etc/apt/apt.conf
[15:49:42] <_joe_>	 and yes, that was my third solutiuon
[15:50:11] <paravoid>	 https://gerrit.wikimedia.org/r/#/c/167835/
[15:50:24] <paravoid>	 https://gerrit.wikimedia.org/r/#/c/169643/
[15:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[15:50:42] <_joe_>	 but that might cause loops too
[15:50:44] <paravoid>	 and https://gerrit.wikimedia.org/r/#/c/179082/
[15:50:48] <_joe_>	 I'll test that next
[15:51:09] <_joe_>	 the last one I remember :)
[15:51:19] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms
[15:51:22] <paravoid>	 but your patch is equivalent to https://gerrit.wikimedia.org/r/#/c/167835/ and won't work :(
[15:51:35] <_joe_>	 paravoid: yeah we're testing things with volans 
[15:51:52] <_joe_>	 I'll try the apt::conf / apt::pin spaceships dependencies
[15:51:57] <paravoid>	 I think ordering/dependency-wise we remove apt.conf, set up apt.conf.d, but don't ensure that apt doesn't run between those two steps
[15:51:59] <_joe_>	 with /etc/apt/apt.conf
[15:52:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 and db1089 - T162807 (duration: 01m 12s)
[15:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:01] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[15:53:30] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms
[15:53:49] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-2] "Equivalent to I481bc29ba5f0b6fef8c61d16e9d1b5e1cfeb0c55, which got reverted by Iecc000fd0c93a428af9c9e8ea2aefa0dbe03313d because it was ca" [puppet] - 10https://gerrit.wikimedia.org/r/404305 (owner: 10Giuseppe Lavagetto)
[15:53:53] <paravoid>	 I left a -2 JIC
[15:54:09] <paravoid>	 if you re-use the changeset for a different approach I can remove
[15:54:26] <paravoid>	 just leaving it there because as-is this will break all package installs :)
[15:54:35] <volans>	 yeah we know ;)
[15:54:36] <_joe_>	 as-is will break puppet
[15:54:39] <_joe_>	 plain and simple
[15:54:46] <paravoid>	 :)
[15:54:48] <volans>	 as is it causes
[15:54:49] <volans>	 Exec[apt-get update] => Class[Apt] => Stage[apt-config] => Stage[main] => Class[Base::Puppet::Puppet4] => Apt::Pin[puppet-all] => File[/etc/apt/preferences.d/puppet_all.pref] => Exec[apt-get update]
[15:54:51] <paravoid>	 yup
[15:54:59] <paravoid>	 or a number of other loops really
[15:55:14] <paravoid>	 just explaining my -2 :)
[15:55:47] <marostegui>	 !log Stop replication in sync db1067 and db1089 - T162807
[15:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:42] <_joe_>	 paravoid: we're basically exploring approaches to find the best one
[15:59:09] <_joe_>	 because at the moment the appservers are borderline uninstallable without a fix
[15:59:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: apt: make apt::conf and apt::pin configs happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307
[16:03:00] <paravoid>	 yeah that should work
[16:03:08] <paravoid>	 could add "before" statements to those apt::confs below
[16:03:22] <paravoid>	 not sure if the pin is needed?
[16:04:02] <_joe_>	 paravoid: it's not /strictly/ needed
[16:04:43] <_joe_>	 but in general this should force puppet 4 to install both apt::pin and apt::conf directives *before* removing the apt.conf file
[16:04:52] <_joe_>	 which should happen before puppet installs any package
[16:04:58] <paravoid>	 not necessarily
[16:05:17] <paravoid>	 the order between these and Packages isn't guaranteed, but do we care?
[16:05:38] <volans>	 as long as apt.conf is  ~ what puppet configure probably not
[16:05:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150)
[16:05:53] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: grafana: Hieraize parameters [puppet] - 10https://gerrit.wikimedia.org/r/404309 (https://phabricator.wikimedia.org/T170150)
[16:06:07] <_joe_>	 it kinda is guaranteed, more or less
[16:06:32] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[16:06:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[16:11:59] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243)
[16:13:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Deprecate passwords::grafana::labs [labs/private] - 10https://gerrit.wikimedia.org/r/404311 (https://phabricator.wikimedia.org/T170150)
[16:13:34] <wikibugs>	 (03PS1) 10Aklapper: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426)
[16:15:02] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243)
[16:18:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Deprecate passwords::grafana::labs [labs/private] - 10https://gerrit.wikimedia.org/r/404311 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[16:18:46] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:19:23] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:20:56] <jynus>	 !log starting codfw s3 master switchover
[16:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:23:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[16:23:26] <wikibugs>	 (03PS1) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315
[16:23:49] <icinga-wm>	 PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-blazegraph-exporter]
[16:24:05] <jynus>	 !log restarting db2036 to set as master
[16:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:39] <gehel>	 godog: ^^^ did you deploy a new blazegraph exporter ?
[16:25:16] <godog>	 gehel: I did yeah, but doesn't work as expected because blazegraph.service isn't a thing
[16:25:48] <gehel>	 godog: :)  yeah, it is wdqs-blazegraph...
[16:26:16] <gehel>	 one more reason to have an systemd override in puppet ...
[16:27:11] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150)
[16:27:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: grafana: Hieraize parameters [puppet] - 10https://gerrit.wikimedia.org/r/404309 (https://phabricator.wikimedia.org/T170150)
[16:27:15] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[16:27:19] <godog>	 indeed
[16:27:57] <wikibugs>	 (03CR) 10Gehel: "puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler03/9731/wdqs1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel)
[16:28:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[16:29:28] <akosiaris>	 sigh jerkins-bot
[16:29:43] <akosiaris>	 I guess I can fold the 2 changes... that should resolve it
[16:30:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307
[16:30:37] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150)
[16:30:39] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[16:31:55] <marostegui>	 !log Force WB on db2033 - T184888
[16:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:07] <stashbot>	 T184888: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888
[16:32:43] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[16:34:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.30, 35.69, 31.82
[16:35:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Don't depend on blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434)
[16:35:19] <_joe_>	 can someone look at 1227?
[16:35:24] <_joe_>	 I'm doing something else
[16:36:50] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM. Runs ok on my puppetmaster+client. It still means that we basically depends on the provided apt.conf in the installer image, and we " [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto)
[16:38:02] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3900967 (10Marostegui) The server kept lagging. I have forced the controller to go to WriteBack temporarily till we decide how to proceed with this host. ``` root@db2033:~# hpssacli controller all s...
[16:39:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "This is by no means a definitive solution, it's just part of the solution for the contingent problem we're trying to solve." [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto)
[16:39:10] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307
[16:41:03] <_joe_>	 !log restarting hhvm on mw1227, threads stuck in HPHP::jit::enterTCImpl
[16:41:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto)
[16:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:40] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi)
[16:43:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:43:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Don't depend on blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi)
[16:43:34] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317
[16:43:46] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for s3 codfw switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui)
[16:44:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Deprecate passwords::grafana::production [labs/private] - 10https://gerrit.wikimedia.org/r/404318
[16:45:35] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:46:03] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317
[16:47:08] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo)
[16:49:07] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Switchover s3 codfw master from db2018 to db2036 (duration: 01m 12s)
[16:49:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:29] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.43, 18.74, 24.00
[16:49:44] <jynus>	 !log finished codfw s3 master switchover
[16:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui)
[16:50:18] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#3901003 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Done, fix deployed
[16:50:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150)
[16:51:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui)
[16:51:57] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui)
[16:52:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Deprecate passwords::grafana::production [labs/private] - 10https://gerrit.wikimedia.org/r/404318 (owner: 10Alexandros Kosiaris)
[16:52:22] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown, 10Wikidata, 10HHVM, 10Patch-For-Review: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3901016 (10ArielGlenn) Snapshot hosts are going directly to php7/stretch, bypassing this issue. See T181029.
[16:53:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 and db1089 - T162807 (duration: 01m 12s)
[16:53:15] <wikibugs>	 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10ayounsi) Port 5 now works. Port 6 doesn't give a shell, but replies some characters on key-press. The other atlas don't seem to be connected to a scs so I can't compare. a few options: 1/ It'...
[16:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:20] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[16:53:38] <jynus>	 !log upgrade and restart db2018
[16:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:00] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is OK: TCP OK - 0.005 second response time on 10.64.32.131 port 9042
[17:01:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150)
[17:02:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 42.98 seconds
[17:02:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.01 seconds
[17:06:32] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1017 is OK: OK - cassandra-b is active
[17:07:11] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-c valid until 2018-08-17 16:11:33 +0000 (expires in 213 days)
[17:07:48] <wikibugs>	 10Operations, 10ops-esams: install/designate other machines as esams bastion - https://phabricator.wikimedia.org/T184936#3901043 (10Dzahn) p:05Triage>03High
[17:08:17] <mobrovac>	 wow godog, both -b and -c came up at the same time?
[17:08:50] <godog>	 !log bootstrap cassandra-c on restbase1017
[17:09:00] <godog>	 mobrovac: hehhe no I started it when I saw -b completed
[17:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:09] <mobrovac>	 :)
[17:11:11] <icinga-wm>	 RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:11:21] <wikibugs>	 10Operations, 10ops-esams: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#3901055 (10Dzahn)
[17:17:01] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1017 is OK: OK - cassandra-c is active
[17:17:38] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[17:17:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150)
[17:17:42] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150)
[17:17:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WIP: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[17:18:51] <icinga-wm>	 RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:21:10] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3898087 (10jcrespo) As a heads up, this is now the s3 master.
[17:22:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler02/9737/krypton.eqiad.wmnet/ is pretty good. I still need to populate ldap.toml config file in" [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[17:28:19] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T142807)
[17:29:45] <icinga-wm>	 RECOVERY - nutcracker port on mw1340 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[17:29:48] <icinga-wm>	 RECOVERY - HHVM processes on mw1340 is OK: PROCS OK: 6 processes with command name hhvm
[17:30:05] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 2.38, 2.39, 2.34
[17:30:06] <icinga-wm>	 RECOVERY - Disk space on mw1340 is OK: DISK OK
[17:30:06] <icinga-wm>	 RECOVERY - dhclient process on mw1340 is OK: PROCS OK: 0 processes with command name dhclient
[17:30:06] <icinga-wm>	 RECOVERY - DPKG on mw1340 is OK: All packages OK
[17:30:15] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3901109 (10akosiaris) Patchsets above clean up puppetization, drop the ugly distinction of labs vs production from code, moving that into h...
[17:30:25] <icinga-wm>	 RECOVERY - nutcracker process on mw1340 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[17:30:27] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T142807)
[17:30:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469)
[17:31:16] <icinga-wm>	 PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:31:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi)
[17:31:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:31:56] <icinga-wm>	 PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:32:05] <icinga-wm>	 PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:32:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:32:26] <icinga-wm>	 PROBLEM - nutcracker port on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:32:35] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:32:35] <icinga-wm>	 PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:32:46] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:05] <icinga-wm>	 PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:16] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:16] <icinga-wm>	 PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:16] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:16] <icinga-wm>	 PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:22] <jynus>	 what is that? software, puppetmaster?
[17:33:24] <volans>	 akosiaris: you jinxed it
[17:33:26] <icinga-wm>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:26] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:36] <icinga-wm>	 PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:40] <volans>	 jynus: puppetdb killed, restarted by systemd
[17:33:46] <jynus>	 ok cool
[17:33:57] <volans>	 let's see the new dashboard luca put 
[17:34:05] <icinga-wm>	 PROBLEM - nutcracker process on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:34:07] <icinga-wm>	 PROBLEM - DPKG on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:34:07] <icinga-wm>	 PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:07] <jynus>	 I mean, not cool, but you get it
[17:34:13] <volans>	 yeah ;)
[17:34:26] <icinga-wm>	 PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:26] <icinga-wm>	 PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:35] <icinga-wm>	 PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:35:46] <icinga-wm>	 PROBLEM - Disk space on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:35:46] <icinga-wm>	 PROBLEM - puppet last run on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:36:35] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: connect to address 10.64.32.50 and port 9005: Connection refused
[17:37:15] <volans>	 I'm re-running puppet on the failed hosts so that they recover now instead of in 30min, expect some recovery spam ;)
[17:37:46] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[17:38:26] <icinga-wm>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[17:39:26] <icinga-wm>	 RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 79239 bytes in 5.401 second response time
[17:41:16] <icinga-wm>	 RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:41:56] <icinga-wm>	 RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:42:05] <icinga-wm>	 RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:42:15] <icinga-wm>	 PROBLEM - nutcracker process on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:42:16] <icinga-wm>	 PROBLEM - DPKG on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[17:42:26] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1338 is OK: OK ferm input default policy is set
[17:42:26] <icinga-wm>	 PROBLEM - nutcracker port on mw1338 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused
[17:42:35] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:42:35] <icinga-wm>	 RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[17:42:56] <icinga-wm>	 RECOVERY - Disk space on mw1338 is OK: DISK OK
[17:43:05] <icinga-wm>	 RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:15] <icinga-wm>	 RECOVERY - DPKG on mw1338 is OK: All packages OK
[17:43:16] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:43:16] <icinga-wm>	 RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:43:16] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:43:16] <icinga-wm>	 RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:26] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:43:35] <icinga-wm>	 RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:44:06] <icinga-wm>	 RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:44:26] <icinga-wm>	 RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:44:35] <icinga-wm>	 RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:44:47] <moritzm>	 !log updating HHVM in deployment-prep to HHVM 3.18.7
[17:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:04] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901174 (10fgiunchedi)
[17:46:25] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:35] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[17:48:54] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10fgiunchedi) p:05Triage>03Normal
[17:49:47] <wikibugs>	 10Operations, 10Goal, 10Technical-Debt, 10User-fgiunchedi: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195#3901194 (10fgiunchedi)
[17:49:49] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901192 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi
[17:50:31] <icinga-wm>	 RECOVERY - nutcracker port on mw1338 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[17:55:50] <icinga-wm>	 RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:56:01] <icinga-wm>	 RECOVERY - nutcracker process on mw1338 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[17:57:17] <wikibugs>	 (03PS5) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215)
[18:00:04] <jouncebot>	 gehel: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:32] <gehel>	 jouncebot: nothing to deploy today...
[18:01:56] <moritzm>	 !log uploading HHVM 3.18.7 (3.18.5+dfsg-1+wmf3) for jessie-wikimedia to apt.wikimedia.org
[18:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100)
[18:03:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[18:04:25] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[18:06:54] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153)
[18:09:34] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms
[18:12:05] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms
[18:13:33] <wikibugs>	 10Operations, 10ops-esams: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#3901246 (10Dzahn) godog points out that we need to copy prometheus performance data from one host to another and that we should write an updated process how to do that
[18:19:27] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2036 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:6 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184946
[18:19:31] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184946#3901253 (10ops-monitoring-bot)
[18:34:03] <logmsgbot>	 !log oblivian@neodymium conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet
[18:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:36] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.32.132 port 9042
[18:42:39] <logmsgbot>	 !log oblivian@neodymium conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet
[18:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:43] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet
[18:43:52] <_joe_>	 uh?
[18:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:58] <_joe_>	 something wrong, sorry
[18:44:44] <_joe_>	 yeah, PEBKAC
[18:50:11] <wikibugs>	 (03Draft2) 10Jayprakash12345: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607)
[18:50:52] <_joe_>	 !log pooled mw1340 as an api appserver
[18:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:14] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3901318 (10Andrew) I powered these off for the moment, just to cut down on dhcp noise.
[18:52:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn)
[19:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:20:59] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3901380 (10faidon) Thanks so much for this, kudos! Any reason to not just 301 grafana-admin to grafana for a few months (and then just drop...
[19:34:26] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1340 is OK: OK
[19:40:13] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184946#3901402 (10Marostegui)
[19:40:15] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3901404 (10Marostegui)
[19:59:44] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3901425 (10ayounsi) Indeed, the instance is not needed anymore. I shut it down and will delete it in a few days.
[21:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T2100).
[21:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[21:00:57] <wikibugs>	 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3901477 (10Qgil) Thank you for your assistance, but it's still not working. https://meta.discourse.org/t/set-up-reply-via-email-support-e-mail/...
[21:17:06] <wikibugs>	 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3901484 (10Tgr) It's under https://myaccount.google.com/apppasswords (a different thing from "apps with access to your account" which is about...
[21:26:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:27:17] <icinga-wm>	 RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 79299 bytes in 0.398 second response time
[21:39:38] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: switch network port 2/0/3 (frdb1003) back to administration-vlan - https://phabricator.wikimedia.org/T184723#3901504 (10ayounsi) 05Open>03Resolved a:03ayounsi Done! ``` [edit interfaces interface-range vlan-fundraising] -    member "ge-[0-1]/0/3"; [edit i...
[21:45:59] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3901513 (10zhuyifei1999)
[21:51:43] <wikibugs>	 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Krenair) wikibooks.wiki too - https://meta.wikimedia.org/wiki/Requests_for_comment/Domain_parking
[21:58:58] <wikibugs>	 (03PS3) 10BryanDavis: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo)
[22:00:04] <jouncebot>	 dapatrick, bawolff, and Reedy: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T2200).
[22:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:05:30] <icinga-wm>	 PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[23:11:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:12:10] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.192 second response time
[23:32:42] <wikibugs>	 (03CR) 10Chad: [C: 04-2] "I don't see anything in the 2.14.7 log thats super important. We're already targeting and testing 2.14.6, let's not move the goalposts." [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad)
[23:36:12] <wikibugs>	 (03CR) 10Chad: [C: 032] Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper)
[23:37:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:37:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper)
[23:37:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper)
[23:38:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 79170 bytes in 0.312 second response time
[23:40:34] <logmsgbot>	 !log demon@tin Synchronized wmf-config/InitialiseSettings.php: turn educationprogram back on for cs.wikipedia -- turns out there was no consensus and a patch should never have been written 😡 (duration: 01m 13s)
[23:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:47] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3901643 (10Peachey88) p:05Low>03Triage Resetting priority for re-triage by ops on-call.  Redirecting users though a random AWS account when they hit...
[23:43:30] <wikibugs>	 (03CR) 10Chad: "*shrug* Fixed version of scap will go live before this needs another deployment" [software/gerrit] - 10https://gerrit.wikimedia.org/r/404221 (https://phabricator.wikimedia.org/T184882) (owner: 10Paladox)
[23:46:48] <Zppix>	 Hey operations, shinken isnt  up
[23:49:26] <no_justification>	 Zppix: Maybe ask cloud services? Production doesn't use it.
[23:49:34] * no_justification goes back to his vacation
[23:50:05] <Zppix>	 no_justification: i should of known that sorry
[23:52:57] <no_justification>	 Also, most Americans will be off today, it's a federal holiday
[23:53:19] <no_justification>	 s/Americans/people working in US timezones/
[23:53:31] <Zppix>	 Well i alerted them in the channel just incase they werent aware :)
[23:53:38] <Zppix>	 Proper*