[00:00:07] <jouncebot>	 RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150130T0000). Please do the needful.
[00:04:11] <icinga-wm>	 PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures  
[00:09:02] <grrrit-wm>	 (03CR) 10Santhosh: [C: 031] cxserver: enable no->nn language pair [puppet] - 10https://gerrit.wikimedia.org/r/186522 (https://phabricator.wikimedia.org/T76674) (owner: 10KartikMistry)
[00:10:04] <grrrit-wm>	 (03PS1) 10Matthias Mullie: Add $wgParsoid... variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187625 
[00:16:59] <grrrit-wm>	 (03PS1) 10Dzahn: decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) 
[00:22:11] <icinga-wm>	 RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures  
[00:22:23] <wikibugs>	 3Wikimedia-Git-or-Gerrit, operations: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1002856 (10Dzahn)
[01:08:21] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[01:30:34] <mutante>	 !log powercycling cp1047
[01:30:42] <morebots>	 Logged the message, Master
[01:33:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[01:35:09] <wikibugs>	 3ops-eqiad, operations: cp1047 down - https://phabricator.wikimedia.org/T88045#1003104 (10Dzahn) p:5Triage>3Normal
[01:36:41] <mutante>	 !log cp1047 - DIMM fail -> T88045
[01:36:58] <morebots>	 Logged the message, Master
[02:14:28] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s)
[02:14:41] <morebots>	 Logged the message, Master
[02:15:35] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf14) at 2015-01-30 02:14:31+00:00
[02:15:44] <morebots>	 Logged the message, Master
[02:19:41] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp1063 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T84809
[02:20:52] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T88045
[02:23:53] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 03s)
[02:24:07] <morebots>	 Logged the message, Master
[02:25:00] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf15) at 2015-01-30 02:23:56+00:00
[02:25:15] <morebots>	 Logged the message, Master
[02:38:21] <grrrit-wm>	 (03PS1) 10Dzahn: torrus: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/187643 (https://phabricator.wikimedia.org/T87817) 
[02:38:30] <wikibugs>	 3Multimedia: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1003414 (10Tgr) a:5Tgr>3None
[02:39:18] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003431 (10Dzahn) a:3Dzahn
[02:39:37] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#999760 (10Dzahn) p:5Triage>3Low
[02:40:51] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#999760 (10Dzahn) this would translate to:   ``` @neon:~# /usr/lib/nagios/plugins/check_http -H torrus.wikimedia.org -I 208.80.154.159 -u '/torrus' -s 'Torrus Top: Wikimedia' HTTP OK: HTTP/1.1 200 OK - 2166 bytes in 0.082 second respo...
[02:41:43] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] torrus: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/187643 (https://phabricator.wikimedia.org/T87817) (owner: 10Dzahn)
[02:49:17] <wikibugs>	 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#1003438 (10Dzahn) a:3ArielGlenn
[02:55:54] <grrrit-wm>	 (03PS11) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[02:56:52] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[02:58:23] <grrrit-wm>	 (03PS12) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[02:59:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[03:01:55] <grrrit-wm>	 (03PS13) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[03:05:45] <grrrit-wm>	 (03PS14) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[03:07:44] <grrrit-wm>	 (03PS15) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[03:08:23] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "PS1 was uploaded on 2014-08-07 ! _please_ unblock and no more rebasing :p" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[03:10:02] <wikibugs>	 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#1003461 (10Dzahn) fixed https://gerrit.wikimedia.org/r/#/c/152724/   needed bunch of rebasing. PS1 was uploaded  2014-08-07  please unblock this now
[03:13:06] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003468 (10Dzahn) @jgage https://gerrit.wikimedia.org/r/#/c/187643/ was supposed to resolve it, but it didnt show up in Icinga yet even though i ran puppet on neon.. will have to debug why
[03:19:09] <wikibugs>	 3operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#1003475 (10Dzahn)
[03:19:10] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003473 (10Dzahn) 5Open>3Resolved ah, here it is, just needed more patience:  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon1001&service=torrus.wikimedia.org+HTTP  https://icinga.wikimedia.org/cgi-bin/...
[03:19:28] <wikibugs>	 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#1003476 (10Dzahn)
[03:25:09] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail  
[03:31:21] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:34:09] <icinga-wm>	 PROBLEM - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /var 3002 MB (3% inode=99%):  
[03:43:50] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures  
[03:49:10] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[04:00:56] <grrrit-wm>	 (03PS1) 10Mattrobenolt: Fix bad symlinks for kafka-common [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 
[04:21:15] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 30 04:20:12 UTC 2015 (duration 20m 11s)
[04:21:20] <morebots>	 Logged the message, Master
[04:38:50] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[04:51:20] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[06:19:00] <icinga-wm>	 PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:01] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:10] <icinga-wm>	 PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:20] <icinga-wm>	 PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: puppet fail  
[06:28:20] <icinga-wm>	 PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail  
[06:28:49] <icinga-wm>	 PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:40] <icinga-wm>	 PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:59] <icinga-wm>	 PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:00] <icinga-wm>	 PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:00] <icinga-wm>	 PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:10] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:10] <icinga-wm>	 PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:10] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:20] <icinga-wm>	 PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:29] <icinga-wm>	 PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:50] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:50] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:19] <icinga-wm>	 PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:19] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:29] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:31:30] <icinga-wm>	 PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:30] <icinga-wm>	 PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:39] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:40] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:32:10] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:32:10] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:32:10] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:36:40] <icinga-wm>	 RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures  
[06:45:29] <icinga-wm>	 RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures  
[06:45:40] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[06:45:49] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures  
[06:45:50] <icinga-wm>	 RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures  
[06:45:50] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[06:45:59] <icinga-wm>	 RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[06:46:00] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures  
[06:46:09] <icinga-wm>	 RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures  
[06:46:19] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[06:46:20] <icinga-wm>	 RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures  
[06:46:29] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures  
[06:46:30] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures  
[06:46:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures  
[06:46:40] <icinga-wm>	 RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures  
[06:46:50] <icinga-wm>	 RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[06:46:50] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[06:46:50] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[06:47:00] <icinga-wm>	 RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures  
[06:47:00] <icinga-wm>	 RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures  
[06:47:00] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures  
[06:47:00] <icinga-wm>	 RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[06:47:10] <icinga-wm>	 RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures  
[06:47:10] <icinga-wm>	 RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures  
[06:47:10] <icinga-wm>	 RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures  
[06:47:50] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[06:48:29] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[07:01:55] <wikibugs>	 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1003802 (10Sunpriat) 3NEW
[08:22:29] <wikibugs>	 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1003895 (10Chmarkine) The certificate for svn.wikimedia.org will expire in about one day. There are some links to https://svn.wikimedia.org on enwiki that will be affected by the expiration.  https://en.wikipedia.org/w/...
[08:49:44] <godog>	 greetings
[09:27:06] <wikibugs>	 3operations: Monitor Netapps - https://phabricator.wikimedia.org/T87836#1003943 (10faidon)
[09:27:07] <wikibugs>	 3operations: Create Icinga alerts for Netapp health - https://phabricator.wikimedia.org/T87839#1003940 (10faidon) 5Open>3declined a:3faidon See parent task for more.
[09:27:56] <wikibugs>	 3operations: Monitor Netapps - https://phabricator.wikimedia.org/T87836#1000337 (10faidon)
[09:27:58] <wikibugs>	 3operations: Retire Torrus - https://phabricator.wikimedia.org/T87840#1003951 (10faidon)
[09:27:59] <wikibugs>	 3operations: Graph Netapp SNMP stats with LIbreNMS - https://phabricator.wikimedia.org/T87837#1003947 (10faidon) 5Open>3declined a:3faidon See parent task for more. Moreover, assuming LibreNMS would be a good candidate was the wrong call; the parent task alone would be enough to describe the whole of this...
[09:42:40] <wikibugs>	 3operations, ops-ulsfo: fan reversed on asw1-ulsfo - https://phabricator.wikimedia.org/T83978#1003963 (10faidon) @Gage, I know you went on-site on Friday; can you update this with the status?
[09:55:04] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) 
[09:55:06] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) 
[10:02:50] <icinga-wm>	 RECOVERY - Disk space on labstore1001 is OK: DISK OK  
[10:04:31] <paravoid>	 !log labstore1001: setting /proc/sys/sunrpc/{nfs,rpc}_debug to 0; rm /var/log/{kern.log,syslog.1,syslog}
[10:04:41] <morebots>	 Logged the message, Master
[10:26:40] <icinga-wm>	 PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail  
[10:28:06] <Coren>	 Jumping on labstore1001 now
[10:28:09] <paravoid>	 hi!
[10:28:21] <paravoid>	 thanks :)
[10:29:43] <Coren>	 Sorry about having been mostly away - I've slept about 40 out of the past 48 hours while the meds finally got rid of the crap in my lungs.  :-(
[10:30:07] <paravoid>	 oops
[10:30:20] <_joe_>	 Coren: now it's just extremely slow
[10:30:24] <Coren>	 Yeah, well, the good part is I cn actually breathe
[10:30:31] <_joe_>	 frm beta
[10:30:43] <_joe_>	 Coren: :/
[10:31:03] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) 
[10:31:04] <Coren>	 _joe_: Yeah, I'm seeing it - it's slow from everywhere.  Something is hammering on it like crazy and I'm tracking down what
[10:31:05] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) 
[10:32:59] <icinga-wm>	 RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[10:33:20] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] introduce graphite raid10-lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/187663 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi)
[10:34:00] <grrrit-wm>	 (03PS16) 10Giuseppe Lavagetto: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage)
[10:41:48] <Coren>	 paravoid: I might need you to take a peek switch-side - something odd is going on that might be network-related: NFS traffic stalls entirely for 4-5 secs, then 4-5 secs for near-saturation burst, then stalls again.
[10:42:09] <Coren>	 paravoid: And the traffic seems to arrive from only one half of the bonded ports
[10:42:49] <paravoid>	 did we ever fix bonding? I think we didn't?
[10:45:19] <Coren>	 paravoid: Well, labstore1001 seems to think it's working; it sees the bonding with two slaves
[10:46:58] <paravoid>	 I distinctly remember having issues with bonding in the past and giving up since there wasn't need for it
[10:47:11] <paravoid>	 both me and mark had tried
[10:47:21] <paravoid>	 the switch has a single-port bond configured
[10:47:29] <paravoid>	 so that should be fine
[10:47:55] <Coren>	 Ah, that'd explain why everything is coming over one side only.  :-)
[10:48:06] <paravoid>	 yeah we've discussed this before
[10:48:16] <_joe_>	 also, it's pprobably something misconfigured since the reboot
[10:48:31] <_joe_>	 and now I get why there where that fuckton of messages
[10:48:38] <paravoid>	 we didn't reboot
[10:48:42] <paravoid>	 lemme have a look
[10:49:05] <paravoid>	 diamond
[10:49:06] <paravoid>	 100% cpu
[10:49:10] <paravoid>	 [pid  3686] stat("/exp/dumps", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[10:49:13] <paravoid>	 [pid  3686] stat("/exp/backups", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[10:49:16] <paravoid>	 in a loop
[10:49:29] <paravoid>	 ah now stopped
[10:49:58] <paravoid>	 md122? md127?
[10:50:48] <paravoid>	 disks are saturated
[10:50:55] <paravoid>	 esp. sdbg
[10:51:05] <Coren>	 Yeah, I'm trying to find the root cause.
[10:51:11] <paravoid>	 dm0 is saturated completely
[10:51:14] <paravoid>	 so, not a network issue
[10:51:46] <Coren>	 paravoid: No, chances are the bursty traffic is a symptom of the disk being bursty
[10:51:55] <Coren>	 paravoid: And not the cause as I first suspected
[10:52:04] <paravoid>	 yup
[10:54:28] <_joe_>	 I'm off for now, gonna grab some food + take a nap
[10:54:56] <Coren>	 The md*_raid6 processes are hard at it - I'd have thought it was doing raid validation but mdstat says not.  Huh.
[11:00:08] <paravoid>	 ttyl
[11:00:51] <_joe_>	 paravoid: send me a message when you're at the hotel 
[11:01:10] <_joe_>	 not sure I'll be awake :)
[11:01:44] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] provision graphite[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/187664 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi)
[11:02:53] <Coren>	 paravoid: Hm.  Unless you changed something, whatever was hammering on the disks stopped hammering on the disks.
[11:10:10] <wikibugs>	 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004037 (10Kelson) 3NEW
[11:23:16] <icinga-wm>	 PROBLEM - uWSGI web apps on graphite1001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running.  
[11:24:06] <icinga-wm>	 PROBLEM - gdash.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time  
[11:24:16] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.010 second response time  
[11:24:25] <icinga-wm>	 PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures  
[11:26:40] <Coren>	 paravoid: I had two outlier instances, ima keep an eye on both of them see if they go cray cray
[11:27:25] <Coren>	 I *was* about to say that things seemed back to normal, but there it goes again.
[11:34:04] <wikibugs>	 3ops-codfw, operations: graphite2001 stuck at boot with "scanning for devices" - https://phabricator.wikimedia.org/T88101#1004046 (10fgiunchedi) 3NEW
[11:46:26] <icinga-wm>	 RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures  
[12:02:56] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.166 second response time  
[12:28:45] <wikibugs>	 3Labs: Puppet logs should be timestamped in a human-readable way - https://phabricator.wikimedia.org/T88108#1004161 (10scfc) 3NEW
[14:50:09] <wikibugs>	 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004254 (10Dzahn)
[14:50:30] <wikibugs>	 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1004256 (10Dzahn) >>! In T73156#1003895, @Chmarkine wrote: > The certificate for svn.wikimedia.org will expire in about one day.  This is T86655 (and historically T24596)
[14:52:18] <wikibugs>	 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004260 (10Dzahn) a:3Chad
[14:53:15] <wikibugs>	 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#973496 (10Dzahn)
[14:54:28] <wikibugs>	 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1004263 (10Dzahn)
[14:54:35] <godog>	 ori: I'm trying to run gdash on ruby 1.9 on graphite1001 though it doesn't seem to honor $: << and doesn't find gdash, did you see this before?
[14:57:34] <wikibugs>	 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004265 (10mark) p:5High>3Normal We're making this a High priority just because the HTTPS cert expires? I don't know, do we care that much for a service essentially fallen in disuse? Su...
[15:02:39] <wikibugs>	 3operations: Sometimes error sec_error_ocsp_old_response - https://phabricator.wikimedia.org/T88087#1004267 (10BBlack)
[15:12:02] <wikibugs>	 3operations: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1004272 (10chasemp) FWIW I cherry-picked the relevant upstream commit on phab-01 for my man @Chad to test if the SVN issue is actually resolved.  This week being crazy I think that still ne...
[15:55:11] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn)
[15:56:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn)
[15:56:48] <wikibugs>	 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004335 (10BBlack) >>! In T87800#1002742, @Dzahn wrote: >> "no longer have a puppet cache role" >  > this looks like they still do: >  > https://gerrit.wikimedia.org/r/#/c/187615/1/manifests/site.pp  Sorry, I should have sa...
[15:57:44] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: graphite: explicit install python-twisted-core [puppet] - 10https://gerrit.wikimedia.org/r/187683 (https://phabricator.wikimedia.org/T85909) 
[16:01:32] <grrrit-wm>	 (03PS1) 10BBlack: disable compact_memory on jessie T83809 [puppet] - 10https://gerrit.wikimedia.org/r/187684 
[16:02:22] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] disable compact_memory on jessie T83809 [puppet] - 10https://gerrit.wikimedia.org/r/187684 (owner: 10BBlack)
[16:13:05] <icinga-wm>	 PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail  
[16:13:40] <bblack>	 ^ generic puppetmaster http error
[16:15:40] <mutante>	 jgage: torrus now monitored https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=torrus
[16:16:17] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 
[16:16:19] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 
[16:16:21] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 
[16:16:23] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 
[16:18:28] <bblack>	 !log restarting frontend varnishes to apply increased cache sizes from https://gerrit.wikimedia.org/r/#/c/186816/ over the next ~9H
[16:18:36] <morebots>	 Logged the message, Master
[16:26:48] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: graphite: format /var/lib/carbon [puppet] - 10https://gerrit.wikimedia.org/r/187690 (https://phabricator.wikimedia.org/T85909) 
[16:27:38] <grrrit-wm>	 (03PS16) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[16:28:11] <grrrit-wm>	 (03PS1) 10Chad: Gerrit: limit max object size to 50M [puppet] - 10https://gerrit.wikimedia.org/r/187691 
[16:29:19] <wikibugs>	 3operations, ops-ulsfo: fan reversed on asw1-ulsfo - https://phabricator.wikimedia.org/T83978#1004360 (10Gage) Thanks for the reminder. Onsite visit was Weds, 2015-01-28. I updated the procurement ticket in https://rt.wikimedia.org/Ticket/Display.html?id=8596 but forgot to also update Phab:  Wrong part was recei...
[16:33:16] <icinga-wm>	 RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures  
[16:34:10] <wikibugs>	 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1004362 (10GWicke) @akosiaris, can we allow connections from tin, so that we can deploy from the deploy host?
[16:35:05] <icinga-wm>	 PROBLEM - DPKG on cp4002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[16:35:06] <icinga-wm>	 PROBLEM - DPKG on cp4001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[16:35:06] <icinga-wm>	 PROBLEM - DPKG on cp4003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[16:35:12] <_joe_>	 mh
[16:35:15] <icinga-wm>	 PROBLEM - DPKG on cp4004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[16:36:06] <icinga-wm>	 RECOVERY - DPKG on cp4001 is OK: All packages OK  
[16:36:06] <icinga-wm>	 RECOVERY - DPKG on cp4002 is OK: All packages OK  
[16:36:06] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "seems good to me assuming this allows direct push for new repo importing etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad)
[16:36:16] <icinga-wm>	 RECOVERY - DPKG on cp4003 is OK: All packages OK  
[16:36:16] <icinga-wm>	 RECOVERY - DPKG on cp4004 is OK: All packages OK  
[16:37:47] <wikibugs>	 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1004363 (10akosiaris) Sure we can but isn't parsoid deployed via trebuchet which does not use SSH at all ?
[16:38:45] <icinga-wm>	 PROBLEM - DPKG on cp3020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[16:39:01] <grrrit-wm>	 (03CR) 10Chad: "It'll also affect new repo imports but why would any repo have a 50M object? For objects that large you should be using git-fat or somethi" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad)
[16:39:40] <grrrit-wm>	 (03CR) 10Rush: "yes, I was thinking total repo size not object size :) You are most correct I think." [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad)
[16:48:07] <bblack>	 !log expect more icinga "CRITICAL: DPKG CRITICAL" on cache nodes for a while; applying backlog of upstream pkg updates slowly to all
[16:48:12] <morebots>	 Logged the message, Master
[17:01:41] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad)
[17:06:12] <^d>	 mutante: We could go ahead and merge that I guess
[17:06:22] <^d>	 It'll restart gerrit but there's never a good time to do that
[17:06:33] <^d>	 Now's as bad as any :p
[17:11:03] <greg-g>	 hey, it's a friday, who cares?!
[17:11:12] <greg-g>	 sarcasm hiding truth
[17:12:18] <mark>	 we need to do /something/ on fridays
[17:12:32] <mark>	 or I can go on happy friday afternoon now if you prefer ;-p
[17:12:38] <mark>	 drinks!
[17:13:27] <^d>	 Gerrit's slow enough right now, good/bad time as any
[17:13:33] <^d>	 If anyone's feeling bold enough to press +2
[17:13:53] <greg-g>	 only one thing in the zuul pipeline: https://integration.wikimedia.org/zuul/
[17:14:06] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Gerrit: limit max object size to 50M [puppet] - 10https://gerrit.wikimedia.org/r/187691 (owner: 10Chad)
[17:14:10] <greg-g>	 :)
[17:14:46] <^d>	 The puppet run on ytterbium should restart gerrit for us
[17:15:25] <mark>	 do you want me to run it manually?
[17:15:37] <^d>	 I'm already logged in, I can do it
[17:15:40] <mark>	 k
[17:16:07] <^d>	 !log running puppet on ytterbium, gerrit shall restart
[17:16:14] <morebots>	 Logged the message, Master
[17:16:56] <^d>	 Ok, and we're back.
[17:17:01] <^d>	 Thanks for the merge mark
[17:19:14] <greg-g>	 w00t, didn't lose any build status reports
[17:23:35] <wikibugs>	 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004398 (10Arlolra) a:3Arlolra
[17:32:39] <wikibugs>	 3Parsoid, operations, Parsoid-Team: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1004415 (10Arlolra) @Kelson Yes, this is related to https usage. Same for uzwiki. Fixing ...
[17:38:41] <logmsgbot>	 !log andyrussg Synchronized php-1.25wmf15/extensions/CentralNotice: Update CentralNotice (duration: 00m 06s)
[17:38:45] <morebots>	 Logged the message, Master
[17:40:53] <grrrit-wm>	 (03CR) 10Jforrester: [C: 04-1] Add $wgParsoid... variables (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187625 (owner: 10Matthias Mullie)
[17:44:17] <grrrit-wm>	 (03PS1) 10BBlack: add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 
[17:49:44] <wikibugs>	 3ops-eqiad, operations: Rack and setup graphite1001 - https://phabricator.wikimedia.org/T86939#1004486 (10Cmjohnson) a:5Cmjohnson>3fgiunchedi
[17:56:08] <wikibugs>	 3ops-core: reclaim dysprosium for spare (was: server status) - https://phabricator.wikimedia.org/T83070#1004512 (10BBlack)
[17:58:41] <grrrit-wm>	 (03PS2) 10BBlack: add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 
[18:00:17] <grrrit-wm>	 (03PS3) 10Dzahn: decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) 
[18:01:59] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] decom cp1037,cp1038,cp1039,cp1040 [puppet] - 10https://gerrit.wikimedia.org/r/187615 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn)
[18:05:35] <icinga-wm>	 PROBLEM - salt-minion processes on cp1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[18:06:35] <icinga-wm>	 PROBLEM - salt-minion processes on cp1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[18:11:21] <wikibugs>	 3ops-codfw, operations: graphite2001 stuck at boot with "scanning for devices" - https://phabricator.wikimedia.org/T88101#1004534 (10fgiunchedi) 5Open>3Invalid a:3fgiunchedi this is actually expected, resolving in favor of T84794 (deployment ticket)
[18:15:41] <icinga-wm>	 PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: Connection refused  
[18:15:55] <wikibugs>	 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1004553 (10fgiunchedi) currently running rsync to transfer metrics changed in the last month to graphite1001, there's ~380k metrics changed in the last 30d and a parallel rsync is churning at ~3/s so ETA for the initial sync...
[18:16:36] <mutante>	 !log cp1037,cp1038,cp1039,cp1040 - disabled puppet, removed from icinga, revoked certs and salt key etc. decom
[18:16:43] <morebots>	 Logged the message, Master
[18:20:48] <AndyRussG>	 Reedy: hi! got a sec? I'm stuck trying to revert my first deploy!!!!
[18:21:03] <AndyRussG>	 anyone....? ^
[18:21:34] <AndyRussG>	 MaxSem: yt? ^
[18:21:42] <chasemp>	 paging greg-g^
[18:21:44] <MaxSem>	 yup
[18:21:49] <greg-g>	 hmm?
[18:21:53] <MaxSem>	 where are you AndyRussG ?
[18:21:56] <greg-g>	 oh
[18:22:06] <AndyRussG>	 MaxSem: back home in Montreal, and figuratively on tin
[18:22:11] <MaxSem>	 kekeke
[18:22:27] <MaxSem>	 should have tried this before leaving
[18:22:38] <AndyRussG>	 It's my first deploy... Yes, true, had to run right after the dev summit tho
[18:22:44] <mutante>	 AndyRussG: yo can click the "revert" button on the relevant gerrit change, then deploy the result of that
[18:23:11] <mutante>	 it should just make a new change for you
[18:23:19] <AndyRussG>	 The revert is in the right wmf branch in core, so it seems: 
[18:23:28] <AndyRussG>	 https://gerrit.wikimedia.org/r/#/c/187709/
[18:23:46] <MaxSem>	 okay, what's the problem?
[18:23:53] <AndyRussG>	 I somehow don't see it on tin
[18:23:58] <MaxSem>	 we can do a skype/hangout, btw
[18:24:24] <AndyRussG>	 Yes you bet, thanks!!
[18:24:28] <Reedy>	 is it under security patches?
[18:24:42] <mutante>	 because it's a subproject somehow
[18:24:49] <Qcoder00>	 Hi
[18:24:54] <Qcoder00>	 Did you just changes something people?
[18:25:05] <Qcoder00>	 I'm getting a page with no CSS
[18:25:13] <Qcoder00>	 Or at the very least a broken layout
[18:25:23] <Qcoder00>	 If you made a change recently I sugget undoing it 
[18:25:41] <MaxSem>	 yup, the change is there, buried under security patches
[18:25:51] <grrrit-wm>	 (03PS1) 10QChris: Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 
[18:25:59] <Qcoder00>	 Well it borke
[18:26:01] <AndyRussG>	 MaxSem: ah maybe that's it then...
[18:26:13] <AndyRussG>	 BTW this is not on production generally, just test, mediawiki.org and 2 others
[18:26:18] <Qcoder00>	 Https version of the site isn;'t supplying the style information
[18:26:28] <Qcoder00>	 http version appears to be
[18:26:28] <MaxSem>	 works for me
[18:26:44] <Qcoder00>	 Propogation issue?
[18:26:54] <AndyRussG>	 Nothing significant is broken, other than a centralnotice campaign running that should show up on mediawiki that doesn't, that's why I'm reverting
[18:26:56] <Qcoder00>	 I've tried the refersh the cached version issues
[18:27:16] <AndyRussG>	 MaxSem: Should I call u via Hangout?
[18:27:19] <Qcoder00>	 AndyRussG:  I am seeing no styling at all
[18:27:21] <bblack>	 !log rebooting cp3020, something's all wrong there...
[18:27:28] <morebots>	 Logged the message, Master
[18:27:29] <MaxSem>	 AndyRussG, I see no problem
[18:27:36] <MaxSem>	 just push your revert now
[18:27:39] <AndyRussG>	 Qcoder00: what?
[18:28:04] <Qcoder00>	 The view I have Wikipedia on https right now  has no tabs , and no styles
[18:28:05] <greg-g>	 bblack: could what you just did affect users not seeing CSS?
[18:28:19] <MaxSem>	 Qcoder00, are you in europe?
[18:28:24] <Qcoder00>	 Yes
[18:28:26] <Qcoder00>	 UK
[18:28:29] <MaxSem>	 hehe
[18:28:32] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: Connection refused by host  
[18:28:40] <MaxSem>	 then it's a cp3020 issue likely
[18:28:48] <AndyRussG>	 Qcoder00: that wasn't me, I didn't push anything to Wikipedia
[18:28:48] <Qcoder00>	 and so far this only seems to be https related
[18:29:04] <Qcoder00>	 It's annoying because it breaks Proofread page stuff at Wikisource
[18:29:18] <AndyRussG>	 MaxSem: Ah I see, right
[18:29:32] <bblack>	 greg-g: the error itself I'm responding to could have caused css problems for a fraction of users hitting esams, yes
[18:29:37] <grrrit-wm>	 (03PS2) 10QChris: Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 
[18:29:51] <AndyRussG>	 MaxSem: So I'll push out the security patches too, I guess? Just push out the latest everything in the wmf15 branch?
[18:30:00] <AndyRussG>	 (to the places it should go)
[18:30:04] <bblack>	 cp3020 is now disabled in pybal though, so the issue should be no more
[18:30:18] <Qcoder00>	 cp3020 being a server?
[18:30:20] <Qcoder00>	 OK
[18:30:21] <MaxSem>	 AndyRussG, they're already there
[18:30:22] <Qcoder00>	 Thanks
[18:30:31] <AndyRussG>	 right
[18:30:33] <AndyRussG>	 yes
[18:30:38] <bblack>	 yes, it's a "bits" frontend cache, which specifically handle css and such
[18:30:39] <Qcoder00>	 I assume a team of server ninjas ahs been dispatched?
[18:30:39] <greg-g>	 bblack: ty
[18:30:42] <MaxSem>	 you just never touch anything you're not actually deploying
[18:31:08] <greg-g>	 Qcoder00: in some form or another, yes, known and being worked on (it should be fixed for you "now")
[18:31:16] <Qcoder00>	 It is
[18:31:22] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%  
[18:31:42] <mutante>	 we had another report on -tech about 500 from bits. user in Europe as well. but confirmed fixed since you disabled it in pybal
[18:31:52] <MaxSem>	 greg-g, speaking of first deployments, I would like to teach phuedx to deploy today while he's still here. is a no-op push ok later today?
[18:31:52] <icinga-wm>	 RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 88.15 ms  
[18:31:52] <wikibugs>	 3operations, ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Cmjohnson) Enabled network port ge-5/0/1 in row B added private1 vlan.  Chris
[18:32:11] <bblack>	 yeah I went to check on that machine when it showed up funny in icinga, and it was having all sorts of crazy issues with hung processes on disk i/o.  it may have a hardware failure
[18:32:19] <bblack>	 we'll see
[18:32:32] <icinga-wm>	 RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.179 second response time  
[18:32:51] <Qcoder00>	 bblack: Thanks
[18:32:59] <bblack>	 heh it booted off its local disks seemingly-ok, but then also:
[18:33:00] <bblack>	 ata_id[584]: HDIO_GET_IDENTITY failed for '/dev/sda': Invalid argument
[18:33:03] <bblack>	 ata_id[585]: HDIO_GET_IDENTITY failed for '/dev/sdb': Invalid argument
[18:33:06] <bblack>	 :)
[18:33:16] <Qcoder00>	 You have backups? XD
[18:33:39] <bblack>	 we don't need them, these are stateless machines
[18:33:47] <bblack>	 but the missing capacity is not ideal
[18:33:57] <Qcoder00>	 I'm suprised to see an erorr like this...
[18:34:10] <Qcoder00>	 I thought you had fall over code?
[18:34:21] <Qcoder00>	 anway fixed for now
[18:34:23] <Qcoder00>	 Thanks
[18:34:25] <Qcoder00>	 :)
[18:34:27] * Qcoder00 out
[18:34:31] <bblack>	 we do, but some failures are tricky when they're strange and partial like this
[18:34:37] <bblack>	 the service was running, sort-of :)
[18:34:39] <logmsgbot>	 !log andyrussg Synchronized php-1.25wmf15/extensions/CentralNotice: Revert update to CentralNotice (duration: 00m 06s)
[18:34:43] <morebots>	 Logged the message, Master
[18:35:01] <AndyRussG>	 MaxSem: ^ I think that's it :)
[18:35:17] <bblack>	 the health checks from the front LBs probably don't check very deeply for correctness, just responsiveness
[18:35:52] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[18:36:21] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[18:36:54] <cmjohnson1>	 !log removing cp1063 from pybal 
[18:36:59] <morebots>	 Logged the message, Master
[18:37:55] <RoanKattouw>	 Hey opsen, do we have a dashboard or something somewhere where we measure the rate of 503s from bits?
[18:39:38] <AndyRussG>	 MaxSem: all clear! fixed the issue, thank you for your help!!!! :D
[18:39:43] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[18:39:47] <MaxSem>	 :P
[18:40:02] <icinga-wm>	 PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail  
[18:40:02] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:40:49] <godog>	 !log initial rsync from tungsten to graphite1001 T85909
[18:40:54] <morebots>	 Logged the message, Master
[18:42:51] <bblack>	 RoanKattouw: well there's https://gdash.wikimedia.org/dashboards/reqerror/
[18:42:57] <bblack>	 but that doesn't break out bits
[18:45:10] <wikibugs>	 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1004684 (10Tnegrin) Hi Gabriel --  do you have a target date for this? 2/15 is Sunday; can we discuss a better day?  thanks,  -Toby
[18:46:02] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Temporarily keep 40 instead of 31 days of webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/187713 (owner: 10QChris)
[18:46:12] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail  
[18:47:26] <greg-g>	 MaxSem: (sorry for delay, in a meeting) yep!
[18:49:16] <wikibugs>	 3ops-codfw, operations: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1004689 (10Papaul) Racked the 30 first mw servers
[18:50:22] <wikibugs>	 3ops-codfw, operations: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1004701 (10Papaul) Racked 20 wtp servers
[18:50:58] <mutante>	 nice @ racked servers
[18:51:26] <mutante>	 !log cp1037,cp1038 - shut down
[18:51:33] <morebots>	 Logged the message, Master
[18:52:41] <icinga-wm>	 RECOVERY - DPKG on cp3020 is OK: All packages OK  
[18:53:42] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[18:55:41] <icinga-wm>	 RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[18:56:08] <bblack>	 !log rebooting cp3020 again (still depooled)
[18:56:12] <morebots>	 Logged the message, Master
[18:57:32] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%  
[18:57:42] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures  
[18:58:51] <icinga-wm>	 RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.18 ms  
[18:59:52] <icinga-wm>	 RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures  
[19:01:11] <icinga-wm>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[19:02:05] <ottomata>	 hey _joe_!, qq for you if you are around
[19:02:50] <mutante>	 !log cp1039, cp1040 - shut down
[19:02:58] <morebots>	 Logged the message, Master
[19:03:53] <phuedx>	 ^d: can i be added to the deployers group in gerrit?
[19:04:03] <phuedx>	 currently trying to self +2 a cherry-pick
[19:04:06] <phuedx>	 (and can't)
[19:13:56] <wikibugs>	 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004747 (10GWicke) @robh, any news?   We are aiming for a release before mid-February. VE performance work (top priority project) depends on RESTBase being available ASAP, so moving fast on this would...
[19:15:16] <wikibugs>	 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1004751 (10GWicke) @tnegrin, we are mostly at the mercy of ops on this (T76986 and T78194). The hardware should be arriving around now, just pinged @RobH about the current status on T76986.
[19:15:47] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] decom cp1037,cp1038,cp1039,cp1040 [dns] - 10https://gerrit.wikimedia.org/r/187626 (https://phabricator.wikimedia.org/T87800) (owner: 10Dzahn)
[19:16:00] <marktraceur>	 AndyRussG: Did you find someone to help you?
[19:17:12] <wikibugs>	 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004754 (10RobH) At the time of order, it was a 2-3 week lead time for shipment.  As that has passed and I have no further update, I've pinged our HP VAR via email (just now.)  I'll update ticket with...
[19:19:04] <wikibugs>	 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004772 (10Dzahn) disabled puppet and salt-minions revoked puppet certs revoked salt keys delete from puppet stored configs  (removed from icinga) shut them down ... removed from DNS ..
[19:19:07] <wikibugs>	 3RESTBase, Scrum-of-Scrums, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1004773 (10GWicke) @RobH, thanks!
[19:21:30] <wikibugs>	 3operations: Put archiva.wikimedia.org behind misc-web-lb and force https - https://phabricator.wikimedia.org/T88139#1004787 (10Ottomata) 3NEW a:3Ottomata
[19:21:47] <AndyRussG>	 marktraceur: yep! all set :) thanks much!
[19:22:07] <AndyRussG>	 it was just a deploy to mw.org and test wikis, and just a CentralNotice banner issue
[19:22:47] * AndyRussG is glad for his collegaues' having urged prudence!
[19:24:14] <wikibugs>	 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004802 (10Dzahn) can we use the same ticket to finish the workflow nowadays ? (instead of creating linked tickets). now it's not separate queues anymore but just adding the relevant tags.  i suggest to just "move" this ove...
[19:27:08] <wikibugs>	 3Phabricator, obsolete, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004819 (10chasemp)
[19:27:25] <logmsgbot>	 !log phuedx Synchronized php-1.25wmf15/extensions/MobileFrontend/: No-op deployment training (duration: 00m 06s)
[19:27:32] <morebots>	 Logged the message, Master
[19:28:37] <wikibugs>	 3Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1004829 (10GWicke) @RobH, @Mark: Any news on this?
[19:29:29] <wikibugs>	 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1004845 (10GWicke)
[19:32:13] <wikibugs>	 3ops-eqiad, operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004864 (10Dzahn)
[19:32:35] <wikibugs>	 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004871 (10chasemp)
[19:36:04] <wikibugs>	 3ops-eqiad, operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1004885 (10Dzahn) the above servers have been removed from puppet, DNS and all that , see above.  please continue with the decom/reclaim workflow with the things local to ops-eqiad,  like wipe disk, derack physic...
[19:47:05] <grrrit-wm>	 (03PS1) 10Legoktm: Update objectcache logging settings for I8a8e278e6f028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187730 
[19:50:42] <wikibugs>	 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004937 (10chasemp) P245  This has been completed I believe.
[19:51:01] <wikibugs>	 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#1004938 (10chasemp) 5Open>3Resolved
[19:54:18] <grrrit-wm>	 (03PS1) 10Rush: ops-core is no longer an active project [puppet] - 10https://gerrit.wikimedia.org/r/187734 
[19:56:35] <grrrit-wm>	 (03CR) 10Rush: [C: 032] ops-core is no longer an active project [puppet] - 10https://gerrit.wikimedia.org/r/187734 (owner: 10Rush)
[19:57:12] <mutante>	 chasemp: :)
[19:58:14] <grrrit-wm>	 (03PS1) 10BryanDavis: Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 
[20:07:02] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail  
[20:28:02] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[20:38:56] <wikibugs>	 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1005000 (10Krinkle)
[20:52:42] <greg-g>	 fyi: I'm taking the afternoon afk (dr's appt etc). I'm emailable but if someone pings re an emergency deploy, I'm not liable to respond quickly
[20:53:53] <wikibugs>	 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1005031 (10GWicke) @akosiaris, trebuchet (salt really) is not able to do rolling restarts reliably, so we are using dsh to actually apply the restart.
[20:54:21] <subbu>	 greg-g, ok. 
[20:55:39] <wikibugs>	 3Ops-Access-Requests: Add James F. to pager duty - https://phabricator.wikimedia.org/T88153#1005037 (10Jdforrester-WMF) 3NEW
[20:58:11] <bblack>	 !log reinstalling cp3020 (seems to have fs corruption issues, but may not be hardware...)
[20:58:17] <morebots>	 Logged the message, Master
[20:58:23] <grrrit-wm>	 (03CR) 10Chad: [C: 031] fix check_elasticsearch CRITICAL output [puppet] - 10https://gerrit.wikimedia.org/r/186418 (owner: 10Filippo Giunchedi)
[20:58:49] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 (owner: 10BryanDavis)
[21:00:22] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%  
[21:00:25] <grrrit-wm>	 (03PS1) 10BBlack: cp3020 -> precise for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/187802 
[21:01:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cp3020 -> precise for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/187802 (owner: 10BBlack)
[21:01:55] <grrrit-wm>	 (03CR) 10Aude: [C: 031] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man)
[21:02:18] <grrrit-wm>	 (03Merged) 10jenkins-bot: Show logging config in noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187736 (owner: 10BryanDavis)
[21:02:31] <icinga-wm>	 RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.15 ms  
[21:02:54] <logmsgbot>	 !log demon Synchronized docroot and w: (no message) (duration: 00m 10s)
[21:02:58] <morebots>	 Logged the message, Master
[21:03:25] <^d>	 Hmmm
[21:03:30] <^d>	 I see logging-labs.php
[21:03:32] <^d>	 Not logging.php
[21:04:18] <logmsgbot>	 !log demon Synchronized docroot/noc/conf/logging.php.txt: (no message) (duration: 00m 06s)
[21:04:22] <morebots>	 Logged the message, Master
[21:04:33] <^d>	 There we go
[21:04:39] <^d>	 sync-docroot skips symlinks?
[21:05:02] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%  
[21:05:11] <^d>	 bd808|BUFFER: http://noc.wikimedia.org/conf/highlight.php?file=logging.php
[21:07:12] <icinga-wm>	 RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 88.99 ms  
[21:07:38] <grrrit-wm>	 (03CR) 10Chad: [C: 032] beta: Change ProfilerSimpleText to ProfilerXhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186604 (owner: 10BryanDavis)
[21:08:36] <mutante>	 docroot/noc ?
[21:09:15] <^d>	 Yeah.
[21:09:32] <icinga-wm>	 PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: Connection refused  
[21:09:42] <icinga-wm>	 PROBLEM - salt-minion processes on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - configured eth on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - Disk space on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - dhclient process on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - RAID on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: Connection refused by host  
[21:09:42] <icinga-wm>	 PROBLEM - DPKG on cp3020 is CRITICAL: Connection refused by host  
[21:09:51] <icinga-wm>	 PROBLEM - HTTPS on cp3020 is CRITICAL: Return code of 255 is out of bounds  
[21:11:56] <grrrit-wm>	 (03CR) 10Hashar: "Filled as T88093 by Krenair" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186319 (owner: 10Jdlrobson)
[21:12:25] <bd808>	 ^d: thanks!
[21:16:23] <^d>	 bd808: yw
[21:16:48] <Krenair>	 oh, wow...
[21:17:06] <Krenair>	 3 reviewers didn't catch that?
[21:18:37] <bd808>	 Krenair: yeah. sadly easy mistake to make in our config soup
[21:23:22] <icinga-wm>	 RECOVERY - dhclient process on cp3020 is OK: PROCS OK: 0 processes with command name dhclient  
[21:23:22] <icinga-wm>	 RECOVERY - Disk space on cp3020 is OK: DISK OK  
[21:23:22] <icinga-wm>	 RECOVERY - configured eth on cp3020 is OK: NRPE: Unable to read output  
[21:23:22] <icinga-wm>	 RECOVERY - salt-minion processes on cp3020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion  
[21:23:22] <icinga-wm>	 RECOVERY - DPKG on cp3020 is OK: All packages OK  
[21:23:23] <icinga-wm>	 RECOVERY - RAID on cp3020 is OK: OK: no disks configured for RAID  
[21:23:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: beta: Change ProfilerSimpleText to ProfilerXhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186604 (owner: 10BryanDavis)
[21:25:22] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 4 failures  
[21:27:21] <icinga-wm>	 RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.177 second response time  
[21:29:32] <icinga-wm>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[21:33:42] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).  
[21:35:52] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%  
[21:36:51] <icinga-wm>	 RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 93.33 ms  
[21:37:02] <icinga-wm>	 RECOVERY - HTTPS on cp3020 is OK: SSLXNN OK - 36 OK  
[21:38:39] <YuviPanda|flight>	 !log killed diamond taking up 100% on labstore1001
[21:38:41] <YuviPanda|flight>	 andrewbogott: ^
[21:38:44] <morebots>	 Logged the message, Master
[21:39:00] <andrewbogott>	 YuviPanda|flight: I killed it for a bit yesterday, it didn’t seem to help much.
[21:39:07] <andrewbogott>	 Although I’m still curious why it was so CPU hungry
[21:39:19] <andrewbogott>	 YuviPanda|flight: won’t puppet restart it shortly?
[21:39:33] <YuviPanda|flight>	 andrewbogott: yeah. an strace didn’t help much
[21:40:06] <grrrit-wm>	 (03PS3) 10Dzahn: add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) 
[21:40:49] <mutante>	 it was more like rpcd was taking up the CPU on labstore
[21:43:41] <YuviPanda|flight>	 svc: transport ffff8800952b7000 busy, not enqueued
[21:43:46] <YuviPanda|flight>	 probably related
[21:45:28] <YuviPanda|flight>	 Brb helping out a lost person 
[21:52:35] <bblack>	 !log re-pooling cp3020 (bits cache esams) - reinstalled, looks sane...
[21:52:42] <morebots>	 Logged the message, Master
[21:54:55] <grrrit-wm>	 (03PS1) 10Phuedx: Configure JS console recruitment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) 
[22:00:51] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.  
[22:02:42] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] Update objectcache logging settings for I8a8e278e6f028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187730 (owner: 10Legoktm)
[22:12:22] <wikibugs>	 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1005216 (10Chmarkine)
[22:20:31] <grrrit-wm>	 (03PS1) 10BBlack: cp1047 -> out for hardware T88045 [puppet] - 10https://gerrit.wikimedia.org/r/187821 
[22:20:59] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cp1047 -> out for hardware T88045 [puppet] - 10https://gerrit.wikimedia.org/r/187821 (owner: 10BBlack)
[22:22:37] <bblack>	 while looking around at pybal cleanup stuff: anyone know the status of mw1018 + mw1118 being disabled for eqiad/apaches + eqiad/api, respectively?
[22:22:46] <bblack>	 they're local changes for pybal, but currently uncommitted
[22:25:51] <bblack>	 nevermind, I found those in the SAL, going to commit them with the SAL msgs
[22:28:38] <grrrit-wm>	 (03PS1) 10Jdlrobson: Add wikidata to central auth config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 
[22:28:40] <grrrit-wm>	 (03PS1) 10Jdlrobson: Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) 
[22:29:04] <grrrit-wm>	 (03PS2) 10Jdlrobson: Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) 
[22:29:47] <grrrit-wm>	 (03Abandoned) 10Jdlrobson: Add wikidata to central auth config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 (owner: 10Jdlrobson)
[22:30:16] <grrrit-wm>	 (03CR) 10Jdlrobson: [C: 04-1] "Think this is in wrong place. See https://gerrit.wikimedia.org/r/187823" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) (owner: 10Phuedx)
[22:31:01] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).  
[22:31:01] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).  
[22:33:52] <bblack>	 ^ oops that was me
[22:34:11] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.  
[22:34:11] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.  
[22:34:48] <grrrit-wm>	 (03CR) 10Aude: "if it gets unabandoned..." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187822 (owner: 10Jdlrobson)
[22:38:40] <subbu>	 !log deployed parsoid version 2abd0eb6
[22:38:45] <morebots>	 Logged the message, Master
[22:56:33] <icinga-wm>	 PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 11 failures  
[23:02:12] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail  
[23:05:12] <wikibugs>	 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005375 (10yuvipanda)
[23:14:12] <icinga-wm>	 RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures  
[23:22:02] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures  
[23:25:52] <wikibugs>	 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1005459 (10mobrovac) p:5Triage>3High
[23:36:04] <wikibugs>	 3RESTBase, operations: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1005556 (10mobrovac) p:5Triage>3High
[23:36:31] <wikibugs>	 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005558 (10yuvipanda) 3NEW
[23:39:44] <wikibugs>	 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005571 (10Merl)
[23:39:47] <wikibugs>	 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005570 (10Merl)
[23:41:16] <wikibugs>	 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005582 (10yuvipanda) Hmm, there's no 1:1 correspondence between the slices that are lagging and the ones that are reporting deadlock errors (s2 in particular). There was a huge spike in network traffic / CPU usage on db1069 sinc...
[23:43:45] <grrrit-wm>	 (03CR) 10Phuedx: "I had no idea mobile.php was a thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) (owner: 10Jdlrobson)
[23:49:58] <wikibugs>	 3RESTBase, operations, Scrum-of-Scrums, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1005604 (10mobrovac)
[23:53:31] <wikibugs>	 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005606 (10yuvipanda) replag is still growing, but my battery is running out, the aiport people are looking at me suspiciously, and my brain isn't at its best after a 26h flight. Hopefully @springle can take a look soon, if not I...