[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161209T0000). Please do the needful.
[00:00:04] <jouncebot>	 kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:17] <yurik>	 see my comment ^^^^ about SWAT
[00:00:35] <ostriches>	 yurik: Has it already been merged to core too
[00:00:35] <ostriches>	 ?
[00:00:41] <ostriches>	 The submodule bump?
[00:00:53] <kaldari>	 here
[00:01:04] <yurik>	 ostriches, it has only been merged to master and wmf5  for the extension
[00:01:09] <yurik>	 we didn't touch core
[00:01:11] <ostriches>	 Needs to go into core too.
[00:01:15] <MaxSem>	 I tried to do the bump, but the result was https://gerrit.wikimedia.org/r/#/c/326060/1
[00:02:09] <ostriches>	 Um...
[00:02:10] <ostriches>	 ok?
[00:02:14] <ostriches>	 #diditwrong
[00:02:15] <ostriches>	 :p
[00:02:20] <ostriches>	 kaldari: Reviewing yours
[00:02:30] <grrrit-wm>	 (03PS2) 10Chad: Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) 
[00:03:35] <paladox>	 ostriches so  much easyer reverting on stable-2.13, i am building it now
[00:03:49] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) 
[00:04:08] <ostriches>	 MaxSem: PS2 looks ok
[00:04:35] <MaxSem>	 arr, I was till looking at ps1
[00:04:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) 
[00:04:40] <MaxSem>	 fucking gerrit
[00:04:50] <ostriches>	 I mean it was in the URL ;p
[00:05:55] <yurik>	 yeah, about a month ago gerrit started remembering older PS number, instead of showing the latest
[00:06:06] <logmsgbot>	 !log demon@tin Synchronized wmf-config/InitialiseSettings.php: swat (duration: 00m 46s)
[00:06:12] <yurik>	 gerrit ppl are really outsmarting themselves with it ;(
[00:06:14] <ostriches>	 kaldari: You're live ^
[00:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:30] <ostriches>	 yurik: Um, it's always done that when you're browsing a specific patch set # :)
[00:06:54] <yurik>	 ostriches, nah, i think it got introduced in the version that added editing beyond comments
[00:06:54] <kaldari>	 ostriches: yay!
[00:06:55] <paladox>	 it should show orange now
[00:07:04] <paladox>	 to tell you your on an old patch.
[00:08:23] <MaxSem>	 well, I'm colorblind. not that I can't see orange, but I learned to pay little attention to colors
[00:09:31] <paladox>	 It should show on the top right.
[00:09:47] <paladox>	 where you click to change patchsets (bar where you press reply)
[00:16:31] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2858961 (10Deskana) >>! In T110171#2858787, @greg wrote: > It's an explicit follow-up from an incident. These should be...
[00:19:44] <MaxSem>	 ostriches, is swat done? can I sync https://gerrit.wikimedia.org/r/#/c/326051/ ?
[00:20:20] <ostriches>	 Yeah was just 1 patch
[00:20:27] <MaxSem>	 thx
[00:22:44] <logmsgbot>	 !log maxsem@tin Synchronized php-1.29.0-wmf.5/extensions/JsonConfig: https://gerrit.wikimedia.org/r/#/c/326051/ (duration: 00m 46s)
[00:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:01] <MaxSem>	 yurik, ^
[00:23:17] <yurik>	 awesome, thanks!!!
[00:23:36] <wikibugs_>	 07Puppet: Inconsistent groups for Git repositories with role::puppetmaster::standalone - https://phabricator.wikimedia.org/T152060#2858993 (10scfc) p:05Triage>03Normal a:03scfc
[00:31:23] <icinga-wm>	 PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 660 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4750794 keys, up 38 days 16 hours - replication_delay is 660
[00:31:23] <icinga-wm>	 PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4754387 keys, up 38 days 16 hours - replication_delay is 614
[00:31:33] <icinga-wm>	 PROBLEM - Redis status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 4756569 keys, up 38 days 15 hours - replication_delay is 609
[00:32:03] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4752415 keys, up 38 days 16 hours - replication_delay is 650
[00:33:03] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9480054 keys, up 38 days 15 hours - replication_delay is 623
[00:37:33] <icinga-wm>	 RECOVERY - Redis status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 4743919 keys, up 38 days 15 hours - replication_delay is 0
[00:38:23] <icinga-wm>	 RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4738087 keys, up 38 days 16 hours - replication_delay is 0
[00:38:23] <icinga-wm>	 RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4741397 keys, up 38 days 16 hours - replication_delay is 0
[00:39:03] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9445496 keys, up 38 days 15 hours - replication_delay is 0
[00:46:03] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4739597 keys, up 38 days 16 hours - replication_delay is 0
[00:47:33] <paladox>	 ostriches i will try to build tommror as i keep getting gc errors
[00:47:37] <paladox>	 out of memory errors
[00:47:51] <paladox>	 i will try and see if i can build it on another host and scp it to there.
[00:47:58] <paladox>	 unless you want to do it.
[00:50:12] <ostriches>	 paladox: No rush on a revert, we can wait a bit
[00:50:16] <ostriches>	 Have a good evening
[00:52:43] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:04:43] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:05:52] <paladox>	 ok
[01:05:54] <paladox>	 thanks
[01:06:00] <paladox>	 ostriches lol it's mornning
[01:06:05] <paladox>	 01:05am
[01:07:45] <grrrit-wm>	 (03Abandoned) 10BryanDavis: logstash: dynamically rename object values [puppet] - 10https://gerrit.wikimedia.org/r/320441 (https://phabricator.wikimedia.org/T150106) (owner: 10BryanDavis) 
[01:08:21] <grrrit-wm>	 (03PS3) 10BryanDavis: l10nupdate: aquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) 
[01:20:03] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 615 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4740542 keys, up 38 days 16 hours - replication_delay is 615
[01:20:43] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[01:32:13] <icinga-wm>	 PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:32:43] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[01:45:19] <paladox>	 ostriches i managed to copy the gerrit folder over to gerrit-test using apache
[01:45:32] <paladox>	 scp, ssh wont work for me, keeps saying something about permission denied.
[01:53:31] <paladox>	 ostriches im deploying it now
[01:54:38] <kaldari>	 !log foreachwiki extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php
[01:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:55:33] <icinga-wm>	 PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4740160 keys, up 38 days 17 hours - replication_delay is 651
[01:55:33] <icinga-wm>	 PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4743539 keys, up 38 days 17 hours - replication_delay is 648
[01:56:03] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9447606 keys, up 38 days 17 hours - replication_delay is 648
[01:56:23] <icinga-wm>	 RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4735028 keys, up 38 days 17 hours - replication_delay is 0
[01:56:23] <icinga-wm>	 RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4738511 keys, up 38 days 17 hours - replication_delay is 0
[01:59:03] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9442179 keys, up 38 days 17 hours - replication_delay is 0
[01:59:37] <paladox>	 ostriches it's started again with the reverted patch, could you please try logging in again
[02:00:12] <paladox>	 Oh nope
[02:00:13] <icinga-wm>	 RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[02:00:14] <paladox>	 dosent work
[02:00:15] <paladox>	 Cannot assign user name "paladox" to account 19; name already in use.
[02:00:20] <paladox>	 when doing Paladox
[02:00:44] <paladox>	 but when i had all to lower case i could use both
[02:00:48] <paladox>	 but you coulden log in
[02:01:03] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4736690 keys, up 38 days 17 hours - replication_delay is 0
[02:21:36] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds
[02:30:01] <godog>	 looks like only db1028 is affected on s7
[02:34:36] <godog>	 disk problem mayve? i don't have access atm
[02:38:44] <godog>	 or perhaps cheduled job,incresed activiry started at around 1 adaics from grafana
[02:55:53] <icinga-wm>	 PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:03:03] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[03:10:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.74 seconds
[03:19:02] <wikibugs>	 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2859207 (10Huji) A user at FA WP also just tested it with a Yahoo! sender address and it wor...
[03:22:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 715.12 seconds
[03:24:53] <icinga-wm>	 RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[03:27:13] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:29:03] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[03:31:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 186.10 seconds
[03:38:10] <godog>	 kaldari: is it possible populateLocalAndGlobalIds.php is spamming centralauth ? we got a slave lag page
[03:38:48] <godog>	 also traffic increased on s7 globally https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s7&var-role=All&from=1481233111803&to=1481254711803
[03:38:55] <kaldari>	 godog: Maybe, I'll kill it for now....
[03:39:48] <godog>	 kaldari: ok thanks! I'll keep an eye on it and see if that was the cause
[03:40:43] <godog>	 only db1028 suffered though, the other slaves didn't have a problem with it
[03:41:01] <kaldari>	 godog: killed it
[03:42:56] <godog>	 yeah written rows are dropping
[03:44:48] <kaldari>	 godog: that's not surprising, all the script does is write a lot of rows, but it's supposed to waitForSlaves after each batch of 1000.
[03:45:42] <kaldari>	 it did about 4 million writes this afternoon before I killed it
[03:47:44] <godog>	 kaldari: ah, db1028 is weighted at 0 in mw I wonder if that's related
[03:53:08] <godog>	 I wonder if db1028 will get a chance of catching up on the lag now
[03:53:48] <godog>	 kaldari: anyways thanks for killing it, so far my best lead is db1028 being at weight 0 and waitforslave not waiting for it
[03:54:08] <kaldari>	 looks like the lag is leveling off at least (rather than continuing to climb)
[03:55:55] <kaldari>	 and now actually went down a little bit
[03:56:57] <kaldari>	 godog: thanks for pinging me, looks like we might be on the way back to normal now.
[03:57:29] <kaldari>	 I was looking at the slave lag when I first started running the script, but hadn't checked it since
[03:57:33] <icinga-wm>	 PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:58:22] <godog>	 kaldari: no worries, yeah it took about 30m to page I think
[03:58:55] <kaldari>	 hmm, now it went back up again
[03:59:42] <kaldari>	 godog: https://tendril.wikimedia.org/chart?hosts=db1028&vars=seconds_behind_master&mode=value
[04:01:26] <godog>	 indeed, I'm comparing it with e.g. db1062 in grafana
[04:01:44] <godog>	 the "write query stats" panel, which it dropped for db1062 but not db1028
[04:04:10] <godog>	 looks like it might be still flushing to disk, writing a lot
[04:04:26] <kaldari>	 yeah, maybe it still has a lot to catch up on
[04:05:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[04:06:53] <godog>	 there we go
[04:07:03] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4739124 keys, up 38 days 19 hours - replication_delay is 647
[04:07:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[04:08:13] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2202.80 Read Requests/Sec=2093.40 Write Requests/Sec=8.50 KBytes Read/Sec=20758.40 KBytes_Written/Sec=55.20
[04:08:29] <godog>	 I'll wait a bit to make sure db1028 is ok
[04:09:02] <godog>	 kaldari: I'll file a task
[04:09:43] <icinga-wm>	 PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:13:41] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859216 (10fgiunchedi)
[04:14:03] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4731976 keys, up 38 days 19 hours - replication_delay is 0
[04:17:18] <godog>	 kaldari: LGTM now, logging off, not yet sure about the root cause tho. I guess the script can be held off for now?
[04:18:13] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=24.38 Read Requests/Sec=0.40 Write Requests/Sec=4.10 KBytes Read/Sec=6.40 KBytes_Written/Sec=38.00
[04:25:33] <icinga-wm>	 RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[04:39:43] <icinga-wm>	 RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[04:48:43] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:02:33] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 76 failures. Last run 2 minutes ago with 76 failures. Failed resources (up to 3 shown): Package[nagios-plugins-basic],Package[apt-transport-https],Package[tree],Package[ngrep]
[05:16:43] <icinga-wm>	 RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[05:24:27] <wikibugs_>	 06Operations, 06Discovery, 06Maps (Tilerator): Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2859266 (10Yurik)
[05:30:22] <wikibugs_>	 06Operations, 03Interactive-Sprint, 06Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2859274 (10Yurik)
[05:30:33] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:09:53] <icinga-wm>	 PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:12:43] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:23:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:28:43] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:30:33] <icinga-wm>	 PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jq]
[06:38:53] <icinga-wm>	 RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[06:40:43] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:54:23] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[06:56:43] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:57:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:58:33] <icinga-wm>	 RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[07:12:43] <icinga-wm>	 PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:22:13] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[07:26:33] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[07:31:15] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859340 (10Marostegui) Thanks guys for taking care of this. A quick HW check reveals no issue with db1028, just to discard issues....
[07:39:28] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859343 (10Peachey88)
[07:39:53] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:41:43] <icinga-wm>	 RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[07:42:34] <marostegui>	 !log Stop MySQL db2034 and db2048 for maintenance - T149553
[07:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:45] <stashbot>	 T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553
[08:00:59] <grrrit-wm>	 (03PS1) 10Yuvipanda: labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 
[08:01:12] <grrrit-wm>	 (03PS2) 10Yuvipanda: labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 
[08:01:15] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 (owner: 10Yuvipanda) 
[08:02:50] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 (owner: 10Yuvipanda) 
[08:08:53] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[08:21:20] <grrrit-wm>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) 
[08:26:12] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) 
[08:26:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) 
[08:29:44] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T150644 (duration: 02m 10s)
[08:29:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:55] <stashbot>	 T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644
[08:39:12] <marostegui>	 !log Deploy alter table S5 wikidatawiki.revision on db1082 - T150644
[08:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:24] <stashbot>	 T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644
[08:41:04] <apergos>	 I'm going to be afk for a little while, I have an errand to run
[08:45:10] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 436249921 for key PRIMARY on query. Default database: commonswiki. Query: [snipped]2 Marostegui T152766
[09:07:41] <grrrit-wm>	 (03PS1) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) 
[09:10:40] <grrrit-wm>	 (03PS2) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) 
[09:11:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:12:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[09:16:53] <grrrit-wm>	 (03PS3) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) 
[09:18:46] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) 
[09:25:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:26:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[09:29:32] <wikibugs>	 06Operations, 06Labs: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767#2859419 (10Volans)
[09:30:51] <grrrit-wm>	 (03PS1) 10Volans: Add missing Hiera for labspuppetbackend_mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/326082 (https://phabricator.wikimedia.org/T152767) 
[09:33:15] <grrrit-wm>	 (03CR) 10Volans: [V: 032 C: 032] Add missing Hiera for labspuppetbackend_mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/326082 (https://phabricator.wikimedia.org/T152767) (owner: 10Volans) 
[09:40:44] <grrrit-wm>	 (03PS1) 10Elukey: Remove the role eventlogging from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/326083 (https://phabricator.wikimedia.org/T152621) 
[09:41:51] <grrrit-wm>	 (03CR) 10Marostegui: "All look good: https://puppet-compiler.wmflabs.org/4845/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) 
[09:43:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:43:51] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) 
[09:44:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[09:44:49] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Remove the role eventlogging from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/326083 (https://phabricator.wikimedia.org/T152621) (owner: 10Elukey) 
[09:47:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:47:33] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[09:47:53] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:48:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[09:48:57] <wikibugs_>	 06Operations, 06Labs, 13Patch-For-Review: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767#2859467 (10Volans) p:05High>03Normal a:05Volans>03None I've quickly added the missing one, the old one `labspuppetbackend::mysql_password` is still there and `hieradata/...
[09:54:23] <grrrit-wm>	 (03PS1) 10Elukey: Add the eventlogging admins back to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/326085 (https://phabricator.wikimedia.org/T152621) 
[09:59:16] <grrrit-wm>	 (03PS1) 10Marostegui: mariadb: Added gtid_domain_id to its own variable [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) 
[09:59:33] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add the eventlogging admins back to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/326085 (https://phabricator.wikimedia.org/T152621) (owner: 10Elukey) 
[10:01:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:01:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:02:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[10:02:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:02:52] <grrrit-wm>	 (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/4848/" [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) 
[10:03:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:03:03] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[10:03:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:04:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[10:04:03] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4721978 keys, up 39 days 1 hours - replication_delay is 0
[10:05:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[10:05:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:08:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:08:52] <grrrit-wm>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 
[10:09:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:09:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:10:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:10:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:10:07] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 (owner: 10Marostegui) 
[10:10:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 (owner: 10Marostegui) 
[10:11:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[10:11:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[10:11:46] <wikibugs>	 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2855084 (10elukey) Last action left: removing unnecessary hiera data belonging to the eventlogging role (that is r...
[10:12:22] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 - T150644 (duration: 00m 46s)
[10:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:36] <stashbot>	 T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644
[10:13:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:15:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:15:53] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[10:18:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:19:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:19:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:20:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[10:22:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:23:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[10:25:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:26:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[10:52:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:53:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:53:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:54:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:54:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:55:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[10:55:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:55:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[10:56:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[10:59:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:59:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[11:01:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:02:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:03:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:04:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[11:04:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:05:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[11:05:55] <grrrit-wm>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) 
[11:07:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[11:07:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:07:36] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) 
[11:08:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:08:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:08:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) 
[11:09:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[11:09:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[11:09:18] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T150644 (duration: 00m 46s)
[11:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:30] <stashbot>	 T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644
[11:10:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:10:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:12:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:12:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:13:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[11:13:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:13:56] <marostegui>	 !log Deploy alter table s5 wikidatawiki.revision on db1087 - T150644
[11:14:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:14:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[11:14:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:14:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:17:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:17:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:17:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:17:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:18:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:18:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:03] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:03] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
[11:20:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[11:20:03] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[11:20:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy
[11:20:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:05] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:20:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
[11:20:54] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[11:21:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:21:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:21:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy
[11:21:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy
[11:22:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[11:22:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[11:23:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:23:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[11:23:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:23:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:23:14] <ema>	 !log upgrading cache_upload to varnish 4.1.4-1wm1
[11:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[11:24:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
[11:26:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:26:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:27:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:27:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:27:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:28:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:28:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:28:03] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:28:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
[11:29:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[11:29:03] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[11:29:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy
[11:29:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[11:30:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[11:30:03] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[11:31:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[11:32:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy
[11:32:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[11:33:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[11:33:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[11:33:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:33:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:33:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:33:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[11:34:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy
[11:34:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:34:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[11:34:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:36:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[11:36:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:36:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:37:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:37:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:38:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:40:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:41:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:41:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:41:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:41:43] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:43:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[11:43:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:45:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:45:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:46:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:46:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[11:48:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:48:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:48:57] <logmsgbot>	 !log mobrovac@tin Starting deploy [changeprop/deploy@9a33bf4]: (no message)
[11:49:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[11:49:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:49:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:49:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:48] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@9a33bf4]: (no message) (duration: 00m 51s)
[11:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:50:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[11:50:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[11:50:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[11:50:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[11:50:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:52:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[11:52:53] <icinga-wm>	 PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:36] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859571 (10jcrespo) @kaldari I do not see long-running script being referenced on https://wikitech.wikimedia.org/wiki/Deployments#...
[11:59:10] <wikibugs_>	 06Operations: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2859574 (10Volans)
[12:00:57] <mobrovac>	 !log scb stopping changeprop in eqiad to investigate outage
[12:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:41] <hashar>	 !log https://grafana-admin.wikimedia.org/dashboard/db/api-requests Made the template variable for MediaWiki.api.main.executeTiming. to be refreshed on dashboard load (that is for the pXX entries)
[12:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:35] <wikibugs>	 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2859600 (10Addshore) As the messages now seem to be appearing this can be rolled out to mw.org on monday.
[12:03:43] <icinga-wm>	 PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:03:53] <icinga-wm>	 PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:04:13] <icinga-wm>	 PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:04:13] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[12:04:13] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[12:04:13] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[12:04:33] <icinga-wm>	 PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:04:33] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage
[12:08:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage
[12:08:01] <icinga-wm>	 ACKNOWLEDGEMENT - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage
[12:08:09] <hoo>	 bblack: https://phabricator.wikimedia.org/T142944#2782663 Would this suffice (for a first trial at least)?
[12:16:39] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1189.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2'])
[12:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:27] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2859620 (10Volans)
[12:21:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time
[12:22:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 74312 bytes in 0.125 second response time
[12:22:53] <icinga-wm>	 RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[12:23:07] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1189.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2'])
[12:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:27:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74313 bytes in 0.138 second response time
[12:27:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.031 second response time
[12:27:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[12:31:53] <icinga-wm>	 PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:44:13] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[12:44:23] <mobrovac>	 !log scb re-enabled changeprop
[12:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:33] <icinga-wm>	 RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational
[12:44:33] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[12:44:43] <icinga-wm>	 RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational
[12:44:54] <icinga-wm>	 RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational
[12:45:13] <icinga-wm>	 RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational
[12:45:13] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy
[12:45:13] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy
[12:50:25] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1289.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2'])
[12:50:29] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1290.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2'])
[12:50:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:44] <grrrit-wm>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 
[12:58:08] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 (owner: 10Marostegui) 
[12:58:43] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 (owner: 10Marostegui) 
[12:59:46] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T150644 (duration: 00m 45s)
[12:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:58] <stashbot>	 T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644
[13:00:53] <icinga-wm>	 RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[13:02:01] <grrrit-wm>	 (03PS1) 10Yuvipanda: labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 
[13:20:34] <grrrit-wm>	 (03PS2) 10Yuvipanda: labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 
[13:20:42] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 (owner: 10Yuvipanda) 
[13:27:34] <icinga-wm>	 PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:29:32] <grrrit-wm>	 (03PS1) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 
[13:29:49] <wikibugs>	 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 2 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859704 (10hashar)
[13:30:53] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[13:31:13] <grrrit-wm>	 (03PS1) 10Volans: Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) 
[13:31:46] <volans>	 marostegui: is that you ^^^ (puppetmaster1001)
[13:31:48] <grrrit-wm>	 (03PS2) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 
[13:32:07] <marostegui>	 volans: don't think so
[13:32:14] <marostegui>	 volans: let me double check
[13:32:32] <volans>	 the missing puppet merge
[13:32:52] <grrrit-wm>	 (03PS3) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 
[13:33:09] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid'])
[13:33:10] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid'])
[13:33:11] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid'])
[13:33:12] <marostegui>	 volans: nope, my change has not been submitted: https://gerrit.wikimedia.org/r/#/c/326086/
[13:33:12] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[13:33:13] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium'])
[13:33:14] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver'])
[13:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:53] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores'])
[13:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:56] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender'])
[13:33:59] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams'])
[13:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:50] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid'])
[13:34:51] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid'])
[13:34:54] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[13:35:00] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid'])
[13:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:05] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[13:35:07] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium'])
[13:35:08] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver'])
[13:35:10] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores'])
[13:35:11] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender'])
[13:35:12] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams'])
[13:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:18] <akosiaris>	 !log depool fully scb1003, scb1004 T150882
[13:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:38] <grrrit-wm>	 (03PS4) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 
[13:35:46] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 (owner: 10Yuvipanda) 
[13:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:41] <stashbot>	 T150882: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882
[13:39:13] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[13:39:13] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:41:06] <akosiaris>	 !log  poweroff scb1003, scb1004
[13:41:13] <wikibugs>	 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859725 (10akosiaris) Depooled and shutdown scb1003, scb1004. Scheduled downtime in icinga as well
[13:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:34] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[13:41:43] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:42:17] <wikibugs_>	 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859726 (10akosiaris) @Cmjohnson The servers are ready for their thermal paste treatment.
[13:42:31] <YuviPanda>	 uh oh
[13:42:33] <YuviPanda>	 I'll look
[13:43:28] <grrrit-wm>	 (03PS2) 10Volans: Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) 
[13:45:50] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) (owner: 10Volans) 
[13:47:13] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - create-dbusers is active
[13:47:13] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[13:47:27] <jynus>	 !log disable puppet on db1047, db1046 and dbstore1002 in preparation for restarts T152188
[13:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:39] <stashbot>	 T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188
[13:47:58] <chasemp>	 YuviPanda: I /thnk/ it's died once before for a similar ldap restarted in the middle of the long ldap query via the mem leak cron
[13:48:18] <chasemp>	 as-is iirc it connects to both ldap servers and round robins the queries to get all users and handles a disconnect poorly
[13:48:37] <YuviPanda>	 chasemp: nope, this is just me. my last patch broke it
[13:48:44] <chasemp>	 ah :)
[13:48:46] <YuviPanda>	 chasemp: new patch coming up
[13:48:53] <grrrit-wm>	 (03PS1) 10Yuvipanda: labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 
[13:49:03] <YuviPanda>	 chasemp: we should probably not be running it in two places tho
[13:49:41] <chasemp>	 well that's a question I imagine, of whether it will be idempotent and it's not a big deal or whether it will race condition against itself etc
[13:50:48] <YuviPanda>	 chasemp: idempotent against *two* simultaneous processes is going to be hard - you've to count every single line of code you write as if there could be a race somewhere else. ANd idk if it buys us anything at all?
[13:51:10] <YuviPanda>	 chasemp: idempotent != concurrent, I ugess
[13:51:14] <chasemp>	 I think the current implementation it's god the possiblity of bad news but I would like to solve it in a way that's not $active_host switches in puppet if possible
[13:51:14] <YuviPanda>	 *guess
[13:51:22] <chasemp>	 s/god/got heh
[13:51:49] <YuviPanda>	 chasemp: what do you have in mind?
[13:52:51] <chasemp>	 right now the mechanism for who is active is controlled via the nfs-manage script and ideally that's the canonical interface, so possibly a marker that derives from that which could be the cluster IP that is only avail on the active or a more explicit drop file
[13:53:01] <chasemp>	 the idea being that when you down or up w/ that it's authoritative
[13:53:05] <YuviPanda>	 chasemp: that sounds good to me
[13:53:12] <YuviPanda>	 chasemp: can you write that up on the ticket?
[13:53:13] <chasemp>	 esp considering time constraints on NFS failure being graceful
[13:53:15] <chasemp>	 yessum
[13:53:32] <YuviPanda>	 I've never seen the nfs-manage script, so I might need some help there. but that's going to be a while anyway
[13:53:38] <chasemp>	 YuviPanda: that main goal one?
[13:54:02] <chasemp>	 sure it's not at all complex, it's basically a procedural on what to bring up for the full stack
[13:54:05] <YuviPanda>	 chasemp: I mean, not today :D I will probably get the script done today, but not deploy it
[13:54:21] <chasemp>	 cool
[13:54:22] <YuviPanda>	 chasemp: there's already a 'canonical store' of all the user accts on m5 :D just user creation left now
[13:54:52] <YuviPanda>	 I haven't figured out if I want this to run on a cron or as a daemon tho
[13:54:53] <chasemp>	 awesome, how did farming the existing creds work out?
[13:55:08] <chasemp>	 I went down that road adn came to the conclusion we stink at monitoring crons
[13:55:29] <YuviPanda>	 chasemp: took me a while to sort out the edge cases. everything is 'stable' now
[13:55:33] <icinga-wm>	 RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[13:55:40] <chasemp>	 that's why I've grudgingly accepted the nfs-exportd etc as a 'timer' 
[13:56:00] <YuviPanda>	 yeah, that's why I've been writing them to be in a looping process
[13:56:12] <wikibugs_>	 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2828786 (10Vriullop) I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile...
[13:56:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 
[13:56:29] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 (owner: 10Yuvipanda) 
[13:57:13] <ema>	 !log upgrading cache_text to varnish 4.1.4-1wm1
[13:57:16] <chasemp>	 YuviPanda: cool man, good looking on the possibility of race condition and general issue there, I'll drop a note on the task today 
[13:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:43] <icinga-wm>	 PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00
[13:59:45] <grrrit-wm>	 (03PS1) 10Volans: Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) 
[14:00:29] <joal>	 Hi ops-team, the eventLogging error is me stopping this consumer for a DB restart
[14:00:31] <grrrit-wm>	 (03PS2) 10Volans: Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) 
[14:00:40] <joal>	 I'll try to acknowledge in icinga
[14:01:15] <icinga-wm>	 ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 Jcrespo stopped for mysql restart
[14:01:48] <joal>	 Thanks jynus 
[14:01:51] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) (owner: 10Volans) 
[14:03:53] <jynus>	 !log setting db1046, db1047, dbstore1002 in read-only mode/stopping replication 
[14:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:16] <jynus>	 !log restarting db1046 T152188
[14:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:28] <stashbot>	 T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188
[14:08:07] <wikibugs_>	 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859757 (10mobrovac)
[14:08:43] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[14:09:33] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - create-dbusers is active
[14:11:22] <grrrit-wm>	 (03PS1) 10Jcrespo: analytics-mariadb: Enable new certificates on eventlogging servers [puppet] - 10https://gerrit.wikimedia.org/r/326122 (https://phabricator.wikimedia.org/T152188) 
[14:13:16] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] analytics-mariadb: Enable new certificates on eventlogging servers [puppet] - 10https://gerrit.wikimedia.org/r/326122 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) 
[14:14:12] <wikibugs>	 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2859773 (10mobrovac) >>! In T151702#2859748, @Vriullop wrote: > I try to expand it from the wiki part. A page with this template transclusion was rendered in...
[14:20:22] <grrrit-wm>	 (03PS1) 10Jcrespo: eventlogging-mariadb: Add new TLS certs to eventlogging severs [puppet] - 10https://gerrit.wikimedia.org/r/326123 (https://phabricator.wikimedia.org/T152188) 
[14:20:53] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] eventlogging-mariadb: Add new TLS certs to eventlogging severs [puppet] - 10https://gerrit.wikimedia.org/r/326123 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) 
[14:27:21] <grrrit-wm>	 (03PS1) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) 
[14:28:39] <grrrit-wm>	 (03PS2) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) 
[14:32:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:32:43] <icinga-wm>	 RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning.
[14:33:40] <grrrit-wm>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4849/" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[14:35:13] <grrrit-wm>	 (03PS3) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) 
[14:35:50] <grrrit-wm>	 (03CR) 10Marostegui: "The new script can be tested at dbstore1001:/home/marostegui" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[14:39:23] <jynus>	 !log restarting db1047 T152188
[14:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:36] <stashbot>	 T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188
[14:39:44] <grrrit-wm>	 (03PS1) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 
[14:40:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 
[14:40:32] <wikibugs_>	 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859816 (10Anomie)
[14:41:22] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 (owner: 10Yuvipanda) 
[14:42:22] <grrrit-wm>	 (03PS3) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 
[14:42:45] <grrrit-wm>	 (03PS4) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 
[14:44:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[14:46:19] <grrrit-wm>	 (03CR) 10Jcrespo: "On a maintenance window, will look at it later, this need careful review because otherwise we could leak private data." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[14:46:34] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 (owner: 10Yuvipanda) 
[14:48:21] <grrrit-wm>	 (03PS4) 10Marostegui: check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) 
[14:48:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[14:48:52] <grrrit-wm>	 (03CR) 10Marostegui: "Sounds good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[14:49:39] <jynus>	 !log restarting dbstore1002 T152188
[14:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:51] <stashbot>	 T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188
[14:50:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180047.91 seconds
[14:51:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[14:52:13] <jynus>	 ^marostegui, dbstore1001 is that you?
[14:52:24] <marostegui>	 no, I haven't done anything
[14:52:36] <jynus>	 ah, I know what it is
[14:52:39] <marostegui>	 I think I ack'ed in the morning the replication break one but maybe not the lag one
[14:52:46] <jynus>	 the lag alert is very large there
[14:52:59] <jynus>	 so it is a 24-hour delayed alert
[14:53:01] <jynus>	 :-)
[14:53:28] <jynus>	 large here means, the lag offset is allowed to go very far
[14:54:44] <jynus>	 will handle that in a second
[14:54:52] <jynus>	 once I fix analytics
[14:57:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[14:57:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[14:58:14] <wikibugs>	 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2859855 (10mehtab.ahmed) @Aklapper : waiting for reply.
[15:00:53] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[15:04:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:04:48] <wikibugs>	 06Operations: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2859863 (10Volans) Related to T142430
[15:12:12] <wikibugs>	 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859881 (10GWicke)
[15:13:35] <grrrit-wm>	 (03PS1) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 
[15:14:22] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) 
[15:15:45] <wikibugs>	 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2859905 (10Lydia_Pintscher) Hey :)  We'd really like to move forward with making the ArticlePlaceholder more useful. It not showing...
[15:16:12] <wikibugs_>	 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2859907 (10Deskana)
[15:16:15] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2859906 (10Deskana) 05Open>03Resolved
[15:16:18] <wikibugs_>	 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859908 (10mobrovac)
[15:16:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:16:51] <grrrit-wm>	 (03PS2) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 
[15:17:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:17:53] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) 
[15:20:03] <grrrit-wm>	 (03PS3) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 
[15:21:37] <grrrit-wm>	 (03PS1) 10Hoo man: Load the property order from Wikidata per default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326133 (https://phabricator.wikimedia.org/T149540) 
[15:21:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:24:06] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 182027.99 seconds Jcrespo https://phabricator.wikimedia.org/T152766
[15:25:26] <wikibugs>	 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859881 (10EBernhardson) Poking through the 'Visualize' tab, kibana 4 reports having both standard and date based histogr...
[15:25:38] <Reedy>	 !log reedy@terbium$ time mwscript refreshImageMetadata.php --wiki=testwiki --force --mediatype=BITMAP | tee /tmp/refreshtestwikiimages.log
[15:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:53] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[15:32:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:32:49] <wikibugs_>	 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859969 (10GWicke) The workflow we were using is more along the lines of:  1) Query some subset of log entries, typically...
[15:34:43] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[15:38:43] <icinga-wm>	 PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:39:33] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[15:39:34] <grrrit-wm>	 (03CR) 10Jcrespo: ".*$ ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[15:40:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857.00 seconds
[15:40:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 692.07 seconds
[15:41:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1062.75 seconds
[15:43:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 132.99 seconds
[15:43:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 207.08 seconds
[15:44:36] <jynus>	 the downtime expired
[15:44:45] <jynus>	 just in time to get solved
[15:46:03] <jynus>	 db1047 should be solved in a minute
[15:47:13] <icinga-wm>	 PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:47:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:48:31] <grrrit-wm>	 (03CR) 10Ema: [C: 032] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) 
[15:48:58] <grrrit-wm>	 (03PS5) 10Marostegui: check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) 
[15:49:27] <grrrit-wm>	 (03CR) 10Marostegui: check_mariadb.pl: Fix small display issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[15:55:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:56:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[15:56:53] <icinga-wm>	 PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:57:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:57:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:59:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 1.81 seconds
[16:00:18] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[16:00:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[16:02:43] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[16:03:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[16:05:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:05:43] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:05:43] <icinga-wm>	 RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[16:07:23] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[16:16:13] <icinga-wm>	 RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[16:21:32] <cmjohnson1>	 !log powering off scb1003 for thermal paste replacement 
[16:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:53] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:24:28] <wikibugs>	 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 5 others: Support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860017 (10mark) I just found an HHVM issue which suggests that max_execution_time is ignored in FCGI mode: https://github.com/facebook/hhvm/is...
[16:25:53] <icinga-wm>	 RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[16:33:22] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] "My recommendation would be to reploy on Monday. Last thing I want is to deploy a friday afternoon a new alerting logic and create a page s" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) 
[16:35:40] <marostegui>	 ^ jynus yes, no way I am deploying that now :)
[16:36:00] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Set hhvm.server.request_timeout_seconds to 60s [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) 
[16:42:46] <wikibugs_>	 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2860051 (10Cmjohnson) Both servers have had their thermal paste removed and replaced.
[16:46:15] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#2804471 (10faidon) I don't think we should mess with the system's PAM config for this -- that's going to be a dangerous change, especially in the long run.
[16:47:01] <wikibugs_>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2860054 (10Cmjohnson)
[16:51:53] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[17:03:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:07:13] <icinga-wm>	 PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed
[17:07:23] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:13:33] <icinga-wm>	 PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:22:03] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] Set hhvm.server.request_timeout_seconds to 60s [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) 
[17:24:23] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[17:25:13] <icinga-wm>	 RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active
[17:25:45] <grrrit-wm>	 (03CR) 10Anomie: "Checking api.log, I see a fair number of requests that would be affected by this if it goes by wall time. Half them seem to be action=pars" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) 
[17:26:03] <icinga-wm>	 PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:27:53] <grrrit-wm>	 (03CR) 10Mark Bergsma: "> Checking api.log, I see a fair number of requests that would be" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) 
[17:32:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[17:37:23] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:41:33] <icinga-wm>	 RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[17:53:43] <icinga-wm>	 PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:55:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[17:59:13] <icinga-wm>	 PROBLEM - carbon-cache@b service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed
[17:59:23] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:00:53] <grrrit-wm>	 (03CR) 10Volans: "In general LGTM, but I have zero knowledge of gdnsd stats ;)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) 
[18:01:58] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] "I really think we need to prioritize system stability and availability over allowing very expensive requests to consume server side resour" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) 
[18:03:59] <grrrit-wm>	 (03PS1) 10Volans: Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) 
[18:04:45] <grrrit-wm>	 (03PS2) 10Volans: Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) 
[18:07:33] <icinga-wm>	 PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:09:18] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) 
[18:11:12] <wikibugs_>	 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: Support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860191 (10Anomie) Tried my code from T97192#1237258 with `hhvm.server.request_timeout_seconds`. Even though changing it in `/etc/hhvm/fcgi.ini...
[18:11:13] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:14:14] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[18:18:53] <grrrit-wm>	 (03CR) 10Volans: "Puppet compiler results: https://puppet-compiler.wmflabs.org/4851/" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) 
[18:19:47] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) 
[18:21:43] <icinga-wm>	 RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[18:22:39] <wikibugs_>	 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2860212 (10fgiunchedi)
[18:22:43] <grrrit-wm>	 (03PS3) 10Anomie: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 
[18:22:54] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:23:37] <volans>	 yay, finally stat1002 works
[18:24:13] <icinga-wm>	 RECOVERY - carbon-cache@b service on graphite1003 is OK: OK - carbon-cache@b is active
[18:24:23] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[18:26:15] <wikibugs_>	 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2860228 (10Volans) given @Ottomata was out today we decided to revert the last change that added that package to unblock Puppet on `stat1002`. On all the hosts in which it wa...
[18:30:05] <wikibugs>	 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860252 (10GWicke)
[18:34:55] <grrrit-wm>	 (03Draft1) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) 
[18:34:57] <grrrit-wm>	 (03Draft2) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) 
[18:35:33] <icinga-wm>	 RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:36:06] <paladox>	 ostriches ^^, that is a  prep patch in case we decide to do that but that will fix one problem, we now just need to pull stable-2.13 and cherry-pick those patches that fixes external_id
[18:36:07] <paladox>	 https://gerrit-review.googlesource.com/#/c/92830/
[18:36:32] <MarcoA>	 can I have someone run fixNamespaceDupes to check a li'l thing?
[18:38:30] <MatmaRex>	 running scripts with 'mwscript' will execute them with PHP, and not HHVM, correct?
[18:38:50] <MatmaRex>	 bd808: you probably know ^
[18:39:47] <ostriches>	 paladox: That would go after the fix + reindex. Then we'll try the patch
[18:39:55] <paladox>	 Yep
[18:40:06] <paladox>	 I put a note in there that it needs your +1
[18:42:23] <paladox>	 ostriches are you pulling from upstream (stable-2.13)?
[18:42:29] <ostriches>	 Yep
[18:42:36] <ostriches>	 I'm building stable-2.13 right now
[18:42:41] <ostriches>	 The changes got merged
[18:42:42] <paladox>	 Ok
[18:42:43] <paladox>	 thanks
[18:42:59] <paladox>	 are you going to cherry-pick https://gerrit-review.googlesource.com/#/c/92830/
[18:43:00] <paladox>	 ?
[18:43:23] <ostriches>	 I don't *think* it'll be necessary.
[18:43:29] <ostriches>	 We could I suppose
[18:43:35] <ostriches>	 We already got these 5:
[18:43:37] <ostriches>	 082f939324 Fix eviction order when linking new external ids
[18:43:38] <ostriches>	 8f83bbbb27 AccountManager#create: Do not overwrite external ID of other account
[18:43:38] <ostriches>	 ca547ff308 Add REST endpoint to reindex a single account
[18:43:38] <ostriches>	 5862be6274 Revert "AccountManager: Check that ext ID belongs to account before delete"
[18:43:38] <ostriches>	 352be569f9 AccountManager: Check that ext ID belongs to account before delete
[18:43:46] <grrrit-wm>	 (03PS3) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 
[18:44:01] <paladox>	 Oh
[18:46:14] <ostriches>	 I think the other fixes make 92830 unnecessary 
[18:46:44] <paladox>	 oh
[18:46:52] <ostriches>	 That being said, it couldn't hurt.
[18:47:04] <paladox>	 Yep
[18:47:30] <ostriches>	 rebuilding, shouldn't take long
[18:47:36] <paladox>	 Yep :)
[18:47:39] <grrrit-wm>	 (03PS1) 10EBernhardson: Don't retry InitImageDataJob's [puppet] - 10https://gerrit.wikimedia.org/r/326151 
[18:47:55] <ostriches>	 Please tell me why we have to rebuild Documentation/licenses every time, even when it doesn't change :p
[18:47:57] <ostriches>	 So slowwwww
[18:48:14] <ostriches>	 Oh well
[18:48:45] <paladox>	 Yeh
[18:48:46] <paladox>	 so slow
[18:49:06] <MarcoA>	 and slower when approach the end
[18:49:45] <paladox>	 LOL, yeh reindex may be quicker this time or not
[18:49:51] <ostriches>	 It should be
[18:49:56] <ostriches>	 I think I can *just* reindex accounts
[18:50:01] <ostriches>	 Have to check
[18:50:01] <paladox>	 yep
[18:50:03] <grrrit-wm>	 (03PS1) 10Rush: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 
[18:50:06] <ostriches>	 Then yeah, should be fast
[18:50:11] <ostriches>	 It's changes that are slow, not accounts
[18:50:31] <paladox>	 Yep
[18:50:44] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) 
[18:52:12] <wikibugs>	 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2860328 (10Aklapper) Feel free to propose in a separate task. :)
[18:52:13] <icinga-wm>	 PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:53:22] <grrrit-wm>	 (03PS2) 10Rush: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 
[18:54:11] <MatmaRex>	 bd808: ok, i now see that mwscript uses php5 and not hhvm. why? D:
[18:54:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) 
[18:54:56] <MatmaRex>	 i see. 8f8e7dbdd834066504e59edfc4881bb98f76072a D:
[18:55:12] <MatmaRex>	 why can nothing ever just work :(
[18:55:34] <grrrit-wm>	 (03PS1) 10Chad: gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 
[18:55:54] <grrrit-wm>	 (03PS1) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 
[18:56:32] <bd808>	 MatmaRex: mwscript *could* use hhvm, but it would be slower
[18:56:42] <grrrit-wm>	 (03CR) 10Chad: "Dunno if that's the right way to format "minor" bullet points. Google says yes I'm not sure if "minor" is the right way to do a list as a " [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (owner: 10Chad) 
[18:56:47] <MatmaRex>	 bd808: context is https://phabricator.wikimedia.org/T32961#2860324
[18:56:52] <bd808>	 how much slower is an "it depends" question
[18:56:53] <grrrit-wm>	 (03CR) 10Paladox: "Probaly want to add Bug: T152640" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (owner: 10Chad) 
[18:56:55] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) 
[18:57:01] <MatmaRex>	 bd808: php is buggy for my use case, and hhvm probably isn't
[18:57:32] <MatmaRex>	 bd808: isn't hhvm supposed to be better for long-runinng maintenance scripts? :/
[18:57:42] <bd808>	 MatmaRex: care to guess how many times it would got the other direction? ;)
[18:58:19] <bd808>	 we could probably add a flag to mwscript that made it possible to use hhvm
[18:58:21] <MatmaRex>	 bd808: oh, i'm sure plenty, just a couple weeks ago i was porting fixes from php to hhvm
[18:58:22] <MarcoA>	 bd808: would you run namespaceDupes.php maintenance script for me for https://phabricator.wikimedia.org/T152793 ? 
[18:58:32] <MatmaRex>	 but just today, php happened to be worse
[18:58:34] <bd808>	 or you can just as easily craft the correct command to do so yourself
[18:58:39] <MatmaRex>	 (and only because we're running an old version)
[18:59:37] <bd808>	 MarcoA: I could probably do that. cawikiquote?
[18:59:46] <MarcoA>	 bd808: yep
[18:59:56] <MatmaRex>	 bd808: hmm, if you're here, wanna run a maintenance script on testwiki for me afterwards? ;)
[19:00:19] <MarcoA>	 mwscript namespaceDupes.php --cawikiquote --fix ?
[19:00:29] <MarcoA>	 not sure about the syntax
[19:00:51] <MarcoA>	 should be on terbium
[19:01:46] <Reedy>	 --wiki=cawikiquote
[19:01:52] <bd808>	 !log Ran `mwscript namespaceDupes.php cawikiquote --fix` for T152793
[19:01:52] <Reedy>	 or just no --wiki=
[19:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:05] <stashbot>	 T152793: Fix namespaces in ca.wiktionary - https://phabricator.wikimedia.org/T152793
[19:02:42] <bd808>	 MarcoA: can you check to see if that fix you problem?
[19:02:46] <MarcoA>	 sure
[19:02:58] <bd808>	 MatmaRex: what's you honeydo?
[19:02:59] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (https://phabricator.wikimedia.org/T152640) (owner: 10Chad) 
[19:03:08] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (https://phabricator.wikimedia.org/T152640) (owner: 10Chad) 
[19:03:25] <MarcoA>	 bd808: doubleredirects now list that page as stricked, so it should be fixed
[19:03:29] <MarcoA>	 thanks a bunch
[19:03:32] <bd808>	 yw
[19:04:35] <MatmaRex>	 bd808: ok, nevermind, i'm getting Reedy to do it. thanks :)
[19:04:53] <Reedy>	 Because buying beers in GBP is cheaper than USD? ;)
[19:04:58] <MatmaRex>	 haha
[19:06:53] <MarcoA>	 I think we've got a problem
[19:07:12] <MarcoA>	 if you type now VD:AJUDA, it redirects you to https://ca.wikiquote.org/wiki/Especial:GoToInterwiki/vd:AJUDA
[19:07:28] <MarcoA>	 hmm
[19:08:23] <Reedy>	 bd808: Didn't we make HHVM CLI not use JIT stuff?
[19:09:09] <grrrit-wm>	 (03PS2) 10RobH: adding new shell users arnad & jgonsior [puppet] - 10https://gerrit.wikimedia.org/r/325868 (https://phabricator.wikimedia.org/T152023) 
[19:09:17] <cmjohnson1>	 !log powering down db1073 to apply thermal paste https://phabricator.wikimedia.org/T149728
[19:09:28] <grrrit-wm>	 (03CR) 10RobH: [C: 032] adding new shell users arnad & jgonsior [puppet] - 10https://gerrit.wikimedia.org/r/325868 (https://phabricator.wikimedia.org/T152023) (owner: 10RobH) 
[19:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:32] <MatmaRex>	 Reedy: bd808: fwiw, mwscript was switched to use php5 in this commit 10 months ago: https://gerrit.wikimedia.org/r/#/c/267816/
[19:09:35] <bd808>	 Reedy: ... don't remember. We did something to make it not pre-fork an exec thread pool
[19:10:07] <bd808>	 yeah... I think that bug should be fixed now
[19:10:15] <bd808>	 that was the pre-fork pool thing
[19:12:14] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: adrian bielefeldt & julius gonsior shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T152023#2860391 (10RobH) 05Open>03Resolved No objections were noted, so this has been merged live.  It will ta...
[19:12:35] <bd808>	 MarcoA: so did namespaceDupes.php mess something up?
[19:12:51] <MarcoA>	 bd808: not sure what might have happened
[19:13:06] <MarcoA>	 Dereckson: you there?
[19:13:45] <bd808>	 MarcoA: I'm going to step away for food, but poke Reedy if you find something that needs help while I'm gone.
[19:13:54] <Dereckson>	 MarcoA: yup
[19:13:59] <Dereckson>	 How can I help you?
[19:14:01] <MarcoA>	 bd808: thanks
[19:14:20] <MarcoA>	 Dereckson: we've just run namespaceDupes at cawikiquote and it seems something went wrong
[19:14:38] * Dereckson checks
[19:14:43] <Dereckson>	 you've the output on a pastebin?
[19:15:09] <MarcoA>	 it's on Phab
[19:15:44] <MarcoA>	 T152793
[19:15:45] <stashbot>	 T152793: Fix namespaces in ca.wiktionary - https://phabricator.wikimedia.org/T152793
[19:15:49] <grrrit-wm>	 (03PS2) 10RobH: new shell user piccardi [puppet] - 10https://gerrit.wikimedia.org/r/325869 (https://phabricator.wikimedia.org/T151969) 
[19:16:21] <godog>	 !log upload gerrit_2.13.3+git1-wmf.1 to carbon - T152640
[19:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:35] <stashbot>	 T152640: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640
[19:16:38] <godog>	 ostriches: ^ it is taking forever to clone gerrit locally, I'll send the patch later
[19:16:41] <MarcoA>	 it seems there's no NamespaceAliases defined but that shouldn't be an issue - it gave Title error before running the script
[19:16:46] <grrrit-wm>	 (03CR) 10RobH: [C: 032] new shell user piccardi [puppet] - 10https://gerrit.wikimedia.org/r/325869 (https://phabricator.wikimedia.org/T151969) (owner: 10RobH) 
[19:17:11] <ostriches>	 godog: Okie dokie, I'm seeing the new package after an update on cobalt
[19:17:32] <MarcoA>	 vd: is an interwiki defined in the interwiki map :S
[19:17:50] <Dereckson>	 MarcoA: okay, I'm looking
[19:18:09] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969#2860421 (10RobH) 05Open>03Resolved No objections were noted, so this has been merged live. It will take up to 30 minutes...
[19:18:50] <Dereckson>	 I'd suggest we remove the namespace
[19:18:52] <ostriches>	 !log gerrit: bringing offline for just a minute or two for bug fix upgrade, T152640
[19:18:52] <Dereckson>	 alias
[19:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:11] <Dereckson>	 and we run again namespacesDupes, it will move the page to main
[19:19:15] <MarcoA>	 Dereckson: which one? there's no alias
[19:19:20] <Dereckson>	 VD:
[19:19:34] <MarcoA>	 you mean, remove it from the interwiki map?
[19:19:48] <Dereckson>	 no no, I thought VD: was also an alias for a ca.wikiquote namespace
[19:19:54] <MarcoA>	 nope
[19:20:12] <MarcoA>	 VD: should just use the main namespace
[19:20:15] <Dereckson>	 ok
[19:20:25] <MarcoA>	 but I think the interwiki map is messing there
[19:20:39] <MarcoA>	 my suggestion is to remove vd: from the interwiki map
[19:21:00] <Dereckson>	 let's try through the api to rename this page
[19:21:13] <icinga-wm>	 RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[19:22:30] <ostriches>	 !log gerrit: back up, including new fixes. Users will have to re-login, sorry :)
[19:22:33] <icinga-wm>	 PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[19:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:54] <paladox>	 ostriches i now have the problems
[19:22:57] <paladox>	 with signing in
[19:22:58] <paladox>	 lol
[19:22:59] <paladox>	 Cannot assign user name "paladox" to account 4335; name already in use.
[19:23:03] <ostriches>	 ....
[19:23:03] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki]
[19:23:06] <paladox>	 i used paladox which worked up until now
[19:23:21] <paladox>	 ostriches i carn't login now
[19:23:33] <paladox>	 Using Paladox dosent work either
[19:23:36] <robh>	 that sound i just heard was chad's sanity snapping.
[19:23:59] <ostriches>	 Interesting, mine's working case-insensitive now too
[19:24:06] <Dereckson>	 MarcoA: bd808:  VD-PMF alraedy exists, that's why it didn't renamed it
[19:24:07] <ostriches>	 ....
[19:24:10] <ostriches>	 I can't with you gerrit....
[19:24:17] <paladox>	 Oh
[19:24:25] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860428 (10kaldari) @jcrespo: Thanks for reminding me to list long-running script runs on the calendar. I had completely forgotten...
[19:24:43] <robh>	 mine works, case insensitive or with proper case.
[19:24:51] <MarcoA>	 Dereckson: maybe they noticed that VD:xxx didn't worked and then used slashes instead of colons
[19:25:10] <paladox>	 Im wondering if this problem is fix for users that actual usernames are uppercase
[19:25:21] <paladox>	 but breaks for users who's actual username is lowercase
[19:25:26] <Dereckson>	         "code": "missingtitle",
[19:25:30] <Dereckson>	         "info": "The page you requested doesn't exist",
[19:25:33] <Dereckson>	 fun
[19:25:38] <Reedy>	 "Please follow along the disccusion at #wikimedia-operations on freenode as we debug."
[19:26:08] <Dereckson>	         "code": "invalidtitle",
[19:26:09] <Dereckson>	         "info": "Bad title \"VD:PMF\"",
[19:26:19] <paladox>	 Reedy likly ostriches may be reindexing
[19:26:24] <paladox>	 as suggested by upstream
[19:26:28] <Krenair>	 I think Gerrit broke ostriches 
[19:26:30] <Dereckson>	 So, not through API
[19:26:49] <Dereckson>	 oh yes
[19:26:55] <Dereckson>	 let's do it through API, but using the pageid
[19:26:58] <ostriches>	 paladox: I'm reindexing you as a test
[19:27:02] <paladox>	 ok
[19:27:04] <paladox>	 thanks :)
[19:27:08] <Dereckson>	 MarcoA: you didn't note the page id by the way?
[19:27:23] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[19:27:33] <icinga-wm>	 PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused
[19:27:40] <MarcoA>	 Dereckson: id=6879 ns=0 dbk=VD:PMF -> VD-PMF (no conflict)
[19:27:43] <ostriches>	 paladox: Ok, try your account now
[19:27:47] <paladox>	 Ok
[19:27:54] <Dereckson>	 MarcoA: that's the one move, I suspect there is another one
[19:27:54] <paladox>	 That worked
[19:28:03] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[19:28:04] <paladox>	 let me try case 
[19:28:05] <ostriches>	 YAY
[19:28:08] <paladox>	 insensitive
[19:28:10] <ostriches>	 Case, less important for now
[19:28:15] <ostriches>	 BUT YAY THE FIX WORKS
[19:28:23] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[19:28:25] <paladox>	 ostriches yay
[19:28:27] <paladox>	 it works
[19:28:30] <Reedy>	 ostriches: remote: ERROR:  committer email address reedy@wikimedia.org        
[19:28:34] <icinga-wm>	 RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.3-13-ge39e5211b7 (SSHD-CORE-1.2.0) (protocol 2.0)
[19:28:36] <Reedy>	 it's gone awol?
[19:28:42] <paladox>	 Reedy try logging in on gerrit
[19:28:43] <icinga-wm>	 PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[19:28:44] <ostriches>	 DON'T GIVE ME NEW PROBLEMS
[19:28:47] <ostriches>	 GO AWAY
[19:28:49] <paladox>	 lol
[19:29:24] <paladox>	 Reedy is that the only error it gave
[19:29:42] <Reedy>	 along with many other lines
[19:29:45] <paladox>	 Oh
[19:29:48] <Reedy>	 Mines completely broken
[19:29:49] <ostriches>	 Reedy: No, you just don't have that e-mail registered...
[19:29:50] <Reedy>	 Cannot assign user name "reedy" to account 4340; name already in use.
[19:29:53] <Reedy>	 ostriches: I did previous
[19:29:55] <Reedy>	 ly
[19:30:02] <ostriches>	 I'm saying you don't now :p
[19:30:04] <Reedy>	 I can't login at all to gerrit
[19:30:15] <ostriches>	 Yeah your account is among the busted.
[19:30:33] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki]
[19:30:37] <ostriches>	 Ok, now I just gotta find all the busted accounts and fix them
[19:30:40] <ostriches>	 It shouldn't break again
[19:30:46] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) 
[19:30:54] <grrrit-wm>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) 
[19:31:18] <paladox>	 ostriches we should create a test user in wikitech with uppercase username and try it in gerrit.
[19:31:25] <ostriches>	 I'm not worried about the casing.
[19:31:28] <ostriches>	 That doesn't matter yet.
[19:31:28] <paladox>	 just to see weather anything will affect new users too.
[19:31:33] <paladox>	 Ok
[19:31:36] <ostriches>	 We need to fix the broken accounts first
[19:31:49] <paladox>	 yep
[19:31:54] <paladox>	 i wonder how do we do that?
[19:32:00] <paladox>	 as we could have a ton
[19:32:06] <ostriches>	 Just gotta find the rows that are busted, shouldn't be hard.
[19:32:10] <paladox>	 oh
[19:32:37] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860443 (10Cmjohnson) The thermal paste on cpu1 was nearly non-existent.  Cleaned both CPU's and re-applied paste.  After booting the server, the disk in slot 2 failed. A ticket has been created w...
[19:32:56] <MarcoA>	 Dereckson: sorry, I don't have the page ID then. I still think VD: conflicts with interwiki map vd: - I'll remove it from there and we can see if that resolves the issue.
[19:33:30] <Dereckson>	 let me try something before
[19:36:09] <MarcoA>	 sorry, haven't read you before - it's removed there but until dumpInterwiki isn't run and merged it won't take effect
[19:38:29] <ostriches>	 Ok, it's 37 total busted users.
[19:38:46] <paladox>	 Oh
[19:39:45] <Dereckson>	 MarcoA: bd808: so, there is no VD:PMF, and the only PMF we have is Viquidites:PMF, see https://ca.wikiquote.org/w/index.php?title=Viquidites:PMF&action=info
[19:40:08] <Dereckson>	 The special double redirect page has still old data: The following data is cached, and was last updated 07:35, 7 December 2016. A maximum of 5,000 results are available in the cache.
[19:43:26] <ostriches>	 paladox: https://phabricator.wikimedia.org/P4603
[19:43:41] <paladox>	 Oh :) :)
[19:43:57] <ostriches>	 Basically, insert all the rows like they should look. IGNORE any that are already there
[19:44:17] <paladox>	 Yep
[19:44:24] <paladox>	 thats a great sql query :)
[19:46:15] <ostriches>	 Lemme test it on my test data
[19:46:35] <paladox>	 Ok :)
[19:47:27] <ostriches>	 Ok....now, here's the final test.
[19:47:34] <paladox>	 Ok :)
[19:47:43] <icinga-wm>	 PROBLEM - carbon-cache@d service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed
[19:47:49] <ostriches>	 Works on the test data.
[19:47:55] <paladox>	 :) :) :)
[19:47:56] <grrrit-wm>	 (03PS3) 10Andrew Bogott: openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) 
[19:48:08] <ostriches>	 !gerrit: Down one last time to fix the busted accounts
[19:48:14] <ostriches>	 !log gerrit: Down one last time to fix the busted accounts
[19:48:23] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:53] <ostriches>	 Ok, crossing fingers....
[19:50:02] <apergos>	 metoo
[19:50:11] * greg-g throws salt over his left shoulder
[19:50:20] <ostriches>	 Ok, I logged in ok....
[19:50:26] <ostriches>	 James_F: Please tell me you can login again
[19:50:27] * ostriches prays
[19:50:33] <James_F>	 Err.
[19:50:59] <James_F>	 Yes.
[19:51:01] <paladox>	 :)
[19:51:03] <James_F>	 Yay! Thank you ostriches.
[19:51:20] <ostriches>	 YAYAYAYAYAYAYAY
[19:51:33] <icinga-wm>	 PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org]
[19:51:33] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui]
[19:52:03] <paladox>	 Reedy could you try logging in please? You should be able to use reedy and Reedy now again :)
[19:52:44] <ostriches>	 Also reedy@wm.o should work again ;-)
[19:52:52] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) 
[19:53:02] <paladox>	 Oh :)
[19:53:16] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#2860488 (10ArielGlenn) We are all agreed that jenkins failing on this would be pretty annoying, since it could block unrelated changes.  BUT surely we can do some sort of regular audit.
[19:53:49] <paladox>	 ostriches project access links are now fixed
[19:54:00] <paladox>	 as you pulled from upstream 2.13 branch it included the fix :)
[19:54:23] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[19:54:43] <icinga-wm>	 RECOVERY - carbon-cache@d service on graphite1003 is OK: OK - carbon-cache@d is active
[19:55:43] <icinga-wm>	 RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[19:55:51] <ostriches>	 kaldari: Are you back in now?
[19:55:54] <ostriches>	 Plz say yes <3
[19:56:03] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[19:56:04] <kaldari>	 kaldari: yep
[19:56:12] <kaldari>	 ostriches: thanks!
[19:56:25] <ostriches>	 YAYAYAYAY <3
[19:56:54] <ostriches>	 Well, that was a fun way to spend my Wed/Thurs/Fri!
[19:56:58] <paladox>	 LOL
[19:57:00] <ostriches>	 Let's do that again real soon.
[19:57:01] <ostriches>	 Not.
[19:57:04] <paladox>	 LOL
[19:57:19] <apergos>	 look at it this way, you'll have the weekend free :-P :-P
[19:57:24] <paladox>	 ostriches onto the next task https://phabricator.wikimedia.org/T152663
[19:57:25] <paladox>	 LOL
[19:57:33] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[19:57:40] <ostriches>	 paladox: Don't remind me!
[19:57:47] <ostriches>	 After lunch, I'm taking a much needed break afk.
[19:57:47] <paladox>	 oh
[19:57:51] <paladox>	 ok
[19:57:51] <ostriches>	 hehe
[20:02:17] <paladox>	 https://groups.google.com/forum/#!topic/repo-discuss/KzLJiNqu2AM
[20:02:19] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2860511 (10Andrew) 05Open>03Resolved
[20:02:22] <paladox>	 ostriches i created ^^
[20:02:38] <paladox>	 so that just in case we need to do anything that is not documented on the docs
[20:03:01] <ostriches>	 I'm sure we can figure it out....
[20:03:08] <ostriches>	 I don't wanna keep annoying upstream :p
[20:03:25] <paladox>	 Ok
[20:04:06] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) (owner: 10Filippo Giunchedi) 
[20:11:26] <grrrit-wm>	 (03Draft1) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 
[20:11:29] <grrrit-wm>	 (03Draft2) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 
[20:13:26] <grrrit-wm>	 (03PS1) 10Papaul: DNS: Add mgmt DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326165 (https://phabricator.wikimedia.org/T152612) 
[20:13:48] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] DNS: Add mgmt DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326165 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) 
[20:19:20] <paladox>	 ostriches i accidently published a comment on that page about submodules using my real name, luckly they have a delete button, so i deleted the comment and published it under my second google account.
[20:19:33] <icinga-wm>	 RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[20:19:33] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[20:19:54] <grrrit-wm>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) 
[20:20:03] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[20:21:23] <icinga-wm>	 RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[20:23:46] <kaldari>	 !log mwscript updateCollation.php --wiki=bnwiki --force
[20:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:33] <paladox>	 I had to report https://bugs.chromium.org/p/gerrit/issues/detail?id=5116
[20:28:38] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) 
[20:32:00] <ostriches>	 paladox: Yeah I saw that
[20:32:07] <paladox>	 Oh :)
[20:32:20] <ostriches>	 I also reported https://bugs.chromium.org/p/gerrit/issues/detail?id=5111 earlier
[20:32:49] <paladox>	 oh
[20:39:05] <icinga-wm>	 PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:53:19] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid'])
[20:53:20] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid'])
[20:53:21] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid'])
[20:53:23] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[20:53:23] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium'])
[20:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:33] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver'])
[20:53:34] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores'])
[20:53:35] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender'])
[20:53:36] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams'])
[20:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:59] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid'])
[20:54:01] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid'])
[20:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:04] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid'])
[20:54:05] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[20:54:11] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium'])
[20:54:14] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver'])
[20:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:17] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores'])
[20:54:18] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender'])
[20:54:23] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams'])
[20:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:36] <wikibugs>	 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2860598 (10akosiaris) I 've fully repooled the servers, let's wait a couple of days and see.
[20:57:21] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[20:57:22] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps'])
[20:57:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:43] <akosiaris>	 !log fully repool scb1003, scb1004, T150882
[20:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:54] <stashbot>	 T150882: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882
[20:58:24] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Remove http monitoring for spice-proxy [puppet] - 10https://gerrit.wikimedia.org/r/326173 
[20:59:58] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Remove http monitoring for spice-proxy [puppet] - 10https://gerrit.wikimedia.org/r/326173 (owner: 10Andrew Bogott) 
[21:03:02] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:08:07] <icinga-wm>	 RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[21:24:46] <wikibugs>	 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859704 (10Legoktm) I don't think this is going to be that useful....the numbers will be all over the place, we really just need to look at th...
[21:25:17] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[21:30:59] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[21:31:39] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:31:45] <chasemp>	 !log hotpatched python maintain-meta_p.py --all-databases --debug on labsdb1001/1003 
[21:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:11] <grrrit-wm>	 (03Draft1) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 
[21:39:14] <grrrit-wm>	 (03Draft2) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 
[21:40:01] <grrrit-wm>	 (03PS3) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 
[21:44:01] <grrrit-wm>	 (03PS4) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 
[21:44:29] <MatmaRex>	 robh: hey, poking you since you're on duty :) but not high priority: do you know when we're planning to upgrdae to HHVM 3.12.11? it would fix metadata extraction for certain files. https://phabricator.wikimedia.org/T148606
[21:45:53] <robh>	 i have no idea, but i imagine _joe_ would (its late his timezone in the eu though so i'll ask him on monday)
[21:46:29] <robh>	 moritz is out on leave so it may not be as fast as normal
[21:46:32] <robh>	 but ill ask
[21:52:19] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[21:53:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89236.08 seconds
[21:59:39] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[22:00:17] <grrrit-wm>	 (03PS4) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 
[22:01:08] <MatmaRex>	 ok, thanks
[22:33:18] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] Don't retry InitImageDataJob's [puppet] - 10https://gerrit.wikimedia.org/r/326151 (owner: 10EBernhardson) 
[22:34:56] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) (owner: 10Filippo Giunchedi) 
[22:40:29] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[22:44:19] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:47:19] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[22:53:19] <wikibugs_>	 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2860862 (10fgiunchedi) 05Resolved>03Open @Cmjohnson looks like we're seeing this again on ms-be1016 :(  ``` root@ms-be1016:~# hpssacli controller all show  Smart Array P840 in Slot 1                (sn...
[22:58:19] <logmsgbot>	 !log catrope@tin Synchronized php-1.29.0-wmf.5/resources/src/mediawiki.language/mediawiki.language.numbers.js: T152800 (duration: 00m 45s)
[22:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:34] <stashbot>	 T152800: Notices and Alerts on Wikipedia Arabic cannot be opened (TypeError: transformTable is undefined) - https://phabricator.wikimedia.org/T152800
[23:07:27] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860886 (10jcrespo) @kaldari, this doesn't have to be synchronous. Please schedule a time with some advance notice on the Deployme...
[23:07:29] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[23:11:43] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860890 (10jcrespo) Thank you very much. I will reset the RAID when the new disk gets installed (if I can handle the bios interface). A new disk failing would explain the previous RAID I/O error.
[23:12:15] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 04-1] "Why would we send gerrit logs into the deployment-prep logstash instance? There is no reason for those logs to leave the production realm." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) 
[23:13:09] <icinga-wm>	 PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:13:55] <grrrit-wm>	 (03CR) 10Paladox: "Hi we woulden't, I am testing it on a labs instance so this patch will change to the prod logstash once I get it working on labs." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) 
[23:16:05] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 04-1] "You should setup your own logstash cluster in the project where you are testing this. The deployment-prep ELK stack isn't a general servic" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) 
[23:16:34] <grrrit-wm>	 (03CR) 10Paladox: "Ok." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) 
[23:30:09] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) 
[23:30:57] <wikibugs>	 06Operations, 10Mail: Create email alias for benefactors@ - https://phabricator.wikimedia.org/T152641#2855852 (10CaitVirtue) Hi All -- Can you share an ETA on this?  We're triaging a high volume of email to benefactors@ right now via gmail, which is really cumbersome. Getting these messages into ZenDesk is goi...
[23:31:04] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "@volans thanks for the review! I've addressed your comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) 
[23:31:21] <wikibugs_>	 06Operations, 10Mail: Create email alias for benefactors@ - https://phabricator.wikimedia.org/T152641#2860929 (10CaitVirtue) p:05Triage>03Unbreak!
[23:40:38] <grrrit-wm>	 (03PS1) 10Chad: No need to import ValueError, it's built in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326199 
[23:42:09] <icinga-wm>	 RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[23:42:17] <wikibugs_>	 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2860965 (10fgiunchedi) p:05Triage>03High
[23:46:55] <paladox>	 ostriches i figured out how to get these submodules to work
[23:46:56] <paladox>	 yay
[23:46:57] <paladox>	 yay
[23:47:25] <paladox>	 just one more test to confirm
[23:48:40] <paladox>	 yeh
[23:48:43] <paladox>	 i figured it out
[23:48:56] <paladox>	 submiting a patch to now give it a test on prod
[23:49:27] <paladox>	 it will require us to set it on all mediawiki/extensions/* otherwise the ones that have it set will work
[23:49:31] <paladox>	 and others that doint
[23:51:29] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:52:10] <ostriches>	 Can we remove it from all of them so it inherits?
[23:52:48] <paladox>	 Yeh i am trying that one
[23:54:57] <paladox>	 ostriches i carn't belive upstream did not mention you have to add subscribe.
[23:54:59] <paladox>	 yay
[23:55:08] <paladox>	 it works without having to set each and every config
[23:55:11] <paladox>	 submiting now
[23:55:49] <ostriches>	 I guess we gotta update all repos...
[23:55:57] <ostriches>	 Will have to do it in batch...
[23:56:34] <paladox>	 no
[23:56:37] <paladox>	 we doint
[23:56:46] <paladox>	 ostriches i managed to set it once in All-Projects
[23:56:51] <paladox>	 and it worked
[23:59:31] <paladox>	 ostriches https://gerrit.wikimedia.org/r/#/c/326200/
[23:59:32] <paladox>	 :)