[00:00:05] <jouncebot>	 twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T0000).
[00:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:32] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 22.09, 22.96, 23.99
[00:04:00] <eddiegp>	 Can someone disable https://phabricator.wikimedia.org/H285 please? It seems to be mistakenly configured to use an OR instead of an AND and thus adds #product-analytics to *any* task that has *any* activity.
[00:04:02] <Dereckson>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix path to hi.wikimedia.org 1x logo ([[Gerrit:427567]])
[00:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:32] <Dereckson>	 Jayprakash12345: okay I sync your namespace change and I'm done
[00:05:00] <eddiegp>	 Ah, I see greg just edited that.
[00:07:03] <wikibugs>	 (03PS1) 10Dereckson: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366)
[00:07:09] <Dereckson>	 Jayprakash12345: you've X-Wikimedia-Debug installed, haven't you?
[00:07:27] <Jayprakash12345>	 yeah I have
[00:07:34] <wikibugs>	 (03CR) 10Dereckson: [C: 032] Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson)
[00:08:23] <Jayprakash12345>	 Just ping me for check patch at mwdebug1002
[00:08:35] <Dereckson>	 okay, waiting for zuul now
[00:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson)
[00:09:11] <Dereckson>	 Jayprakash12345: done
[00:09:55] <Jayprakash12345>	 Looks good, Now Namespace in Hindi
[00:10:04] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4141484 (10EddieGP)
[00:10:09] <Jayprakash12345>	 Go ahead
[00:11:04] <wikibugs>	 (03CR) 10jenkins-bot: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson)
[00:12:11] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set project namespace for hi.wikimedia (T188366) (duration: 01m 16s)
[00:12:17] <Dereckson>	 !log Wikis creation done
[00:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:18] <stashbot>	 T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366
[00:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:00] <eddiegp>	 jouncebot: now
[00:13:00] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T0000)
[00:14:43] <Jayprakash12345>	 Dereckson: Thanks for being here
[00:14:48] <Dereckson>	 You're welcome.
[00:15:32] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester)
[00:17:49] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141502 (10EddieGP)
[00:21:21] <wikibugs>	 (03PS4) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672)
[00:21:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[00:23:02] <wikibugs>	 (03CR) 10EddieGP: "https://phabricator.wikimedia.org/T192473#4141453" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 (owner: 10Aaron Schulz)
[00:23:50] <wikibugs>	 (03PS5) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672)
[00:24:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[00:38:54] <wikibugs>	 (03PS6) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672)
[00:54:12] <wikibugs>	 (03PS7) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672)
[00:57:34] <wikibugs>	 (03CR) 10Dzahn: [C: 032] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[01:05:13] <icinga-wm>	 PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/org/wikimedia/releases/parsoid]
[01:06:12] <icinga-wm>	 PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/org/wikimedia/releases/parsoid]
[01:25:42] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational
[01:28:42] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:47:37] <wikibugs>	 10Operations, 10Mail, 10Surveys: Qualtrics cannot send email to wikimedia.org addresses - https://phabricator.wikimedia.org/T176666#4141649 (10Neil_P._Quinn_WMF)
[01:57:54] <wikibugs>	 (03Abandoned) 10Krinkle: Just run updateArticleCount.php over all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/363639 (owner: 10Reedy)
[02:36:44] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 05m 52s)
[02:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:46] <AndyRussG>	 Dereckson: hi! no, thankfully not an emergency
[03:18:44] <urandom>	 !log decommissioning Cassandra, restbase1010-c -- T189822
[03:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:18:50] <stashbot>	 T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822
[04:42:21] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141728 (10aaron) The warnings are pointless, the patch above adds an isset() check.
[05:21:16] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148)
[05:23:43] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:24:56] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:26:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 for alter table (duration: 01m 33s)
[05:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:23] <marostegui>	 !log Deploy schema change on db1097:3315 - T191519 T188299 T190148
[05:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:30] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[05:27:30] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[05:27:31] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[05:29:11] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:31:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141772 (10Marostegui) So this is almost confirmed related to atop. I killed it yesterday at around 14:30 and it was remained stopped till 00:00 (where it started automatic...
[05:33:24] <marostegui>	 !log Revert RX buffer changes on db1114 - T191996
[05:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:30] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[05:35:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141775 (10Marostegui) RX buffers reverted ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX:  2047 RX Mini: 0 RX Jumbo: 0 TX:  511 Current...
[05:36:08] <marostegui>	 !log Kill atop on db1114 - T191996
[05:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:30] <wikibugs>	 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#4141809 (10Joe)
[05:56:34] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141813 (10Joe)
[05:58:05] <wikibugs>	 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#4141816 (10Joe)
[05:58:11] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149754 (10Joe)
[05:59:58] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141821 (10Joe) We've had 3 mdadm checkarray full runs since we merged the change in february, and no alert went off in the meantime. I would be inclined to consider this...
[06:00:15] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141822 (10Joe) 05Open>03Resolved
[06:05:36] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10Joe) I would strongly suggest that any system that wants to archive geoip data from maxmind should create its own repository of data and NOT...
[06:08:01] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4141827 (10Joe) >>! In T136732#4139610, @Ottomata wrote: > We could do that, but we wanted something centralized and reproducable (e.g. include a puppe...
[06:12:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "releases: add directory for parsoid archive" [puppet] - 10https://gerrit.wikimedia.org/r/427594
[06:12:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "releases1001 puppet-agent[10365]: Could not find user releasers-parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto)
[06:16:12] <icinga-wm>	 RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:21:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: contint::packages::php: fix lua package name [puppet] - 10https://gerrit.wikimedia.org/r/427595
[06:23:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] contint::packages::php: fix lua package name [puppet] - 10https://gerrit.wikimedia.org/r/427595 (owner: 10Giuseppe Lavagetto)
[06:28:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf]
[06:31:52] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/enforce-users-groups]
[06:34:36] <wikibugs>	 (03CR) 10Gilles: [C: 031] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[06:35:13] <icinga-wm>	 RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:37:57] <wikibugs>	 (03PS1) 10Elukey: role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596
[06:39:15] <wikibugs>	 (03PS2) 10Elukey: role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596
[06:39:50] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596 (owner: 10Elukey)
[06:49:32] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.33 and port 9042: Connection refused
[06:49:42] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[06:49:52] <icinga-wm>	 PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:50:10] <elukey>	 hello restbase1016
[06:50:12] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[06:54:26] <elukey>	 can't find much in the cassandra logs
[06:56:52] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:05] <elukey>	 I'd ping the services team before attempting any restarts, so they can take a look to what happened
[06:57:15] <elukey>	 it shouldn't be a problem if one instance is down
[06:57:19] <elukey>	 mobrovac: --^
[06:57:44] <moritzm>	 or ^ godog
[06:57:59] <elukey>	 yep yep
[06:58:43] <icinga-wm>	 RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:01:43] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1016 is OK: OK - cassandra-b is active
[07:01:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational
[07:01:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597
[07:03:23] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui)
[07:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui)
[07:06:12] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 after alter table (duration: 01m 17s)
[07:06:13] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-b valid until 2018-08-17 16:11:27 +0000 (expires in 120 days)
[07:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:32] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is OK: TCP OK - 0.001 second response time on 10.64.0.33 port 9042
[07:06:58] <elukey>	 puppet brought it up again :)
[07:09:14] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui)
[07:16:51] <wikibugs>	 10Operations: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4141839 (10akosiaris) 05stalled>03Open >1 month with no incident. I 'll proceed with rebooting all ganeti VMs on row_C and then move on to codfw
[07:16:54] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage lvs4006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427598 (https://phabricator.wikimedia.org/T191897)
[07:24:16] <akosiaris>	 !log reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121
[07:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:22] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[07:26:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427598 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[07:27:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[07:28:06] <godog>	 elukey: ack, yeah I think we saw that before :(
[07:30:28] <elukey>	 :(
[07:32:23] <vgutierrez>	 !log Depool and reimage lvs4006 - T191897
[07:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:29] <stashbot>	 T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897
[07:34:30] <wikibugs>	 (03PS1) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[07:37:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431)
[07:43:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi)
[07:43:38] <wikibugs>	 (03PS8) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378
[07:47:30] <akosiaris>	 !log set cache=none for ganeti VMs in codfw cluster configuration. VM reboots to follow T181121
[07:47:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:37] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[07:47:42] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:47:43] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:48:02] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:48:02] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:48:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:48:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200)
[07:48:18] <akosiaris>	 404 ?
[07:48:20] <wikibugs>	 (03PS2) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[07:48:32] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[07:48:40] <vgutierrez>	 wut?
[07:48:42] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[07:48:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[07:48:43] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[07:48:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[07:48:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[07:48:46] <akosiaris>	 a recovering already ? what on earth happened ?
[07:48:52] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[07:48:52] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[07:48:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[07:49:02] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[07:49:02] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[07:49:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[07:49:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[07:49:15] <elukey>	 akosiaris: maybe somebody changed the cat page?
[07:49:47] <elukey>	 https://en.wiktionary.org/w/index.php?title=cat&action=history
[07:49:47] <akosiaris>	 article hasn't changed since 4th of Feb so unrelated to that
[07:49:50] <_joe_>	 that looks like a real issue on mobileapps
[07:49:59] <_joe_>	 no way it's a problem with wikipedia
[07:50:15] <akosiaris>	 wait, Cat vs cat
[07:50:17] <_joe_>	 or wiktionary
[07:50:47] <akosiaris>	 I am not so sure about that
[07:51:04] <elukey>	 the last change is a couple of mins ago
[07:51:18] <akosiaris>	 yes, and someone decided to inline the image base base64
[07:51:25] <_joe_>	 ok yeah I think it's vandalism
[07:51:28] <akosiaris>	 image as base64*
[07:51:42] <_joe_>	 cat, not CAT
[07:51:58] <elukey>	 yes I looked for "cat"
[07:52:56] <wikibugs>	 (03PS3) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[07:53:22] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141874 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4006.ulsfo.wmnet ``` The log can be found in `/var/lo...
[07:53:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[07:54:21] <akosiaris>	 so it's a small edit war right now
[07:54:34] <_joe_>	 akosiaris: you mean a vandal
[07:54:52] <akosiaris>	 yeah, vandals wage edit wars by definition
[07:54:57] <wikibugs>	 (03CR) 10Elukey: [C: 031] Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff)
[07:54:58] <akosiaris>	 but yes it's vandalism
[07:56:02] <akosiaris>	 and the bot just reverting the damage
[07:56:07] <akosiaris>	 but the 404 was a tad weird
[07:58:27] <_joe_>	 the 404 is mobileapps telling the client that there was no summary to be fetched for the page
[07:58:31] <_joe_>	 so kinda expected?
[07:58:49] <_joe_>	 if the vandalism wiped out the definition
[08:05:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4141920 (10ema)
[08:05:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4141918 (10ema) 05Resolved>03Open Re-opening, this morning we had two icinga criticals for lawrencium and lawrencium.mgmt being down. Some de...
[08:08:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4102460 (10MoritzMuehlenhoff) There are still DNS entries in git:  jmm@korn:~/git/dns$ rgrep lawrenc * templates/10.in-addr.arpa:94      1H  IN P...
[08:11:47] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141946 (10EddieGP)
[08:14:07] <moritzm>	 !log upgrading app server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854)
[08:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:14] <stashbot>	 T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854
[08:14:22] <ema>	 !log reboot deploy1001 and arm keyholder T175288
[08:14:23] <akosiaris>	 _joe_: yeah that's exactly what happened
[08:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:28] <stashbot>	 T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288
[08:14:30] <akosiaris>	 vandal removed the definitions
[08:14:39] <wikibugs>	 (03CR) 10DCausse: Add cirrussearch settings for wikibase (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse)
[08:14:41] <wikibugs>	 (03PS15) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717)
[08:14:51] <icinga-wm>	 RECOVERY - nutcracker port on deploy1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[08:15:01] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational
[08:15:41] <icinga-wm>	 RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker
[08:15:53] <wikibugs>	 (03PS1) 10Gilles: Upgrade to 2.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/427612 (https://phabricator.wikimedia.org/T27611)
[08:20:18] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141972 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4006.ulsfo.wmnet'] ```  and were **ALL** successful.
[08:20:22] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141975 (10EddieGP)
[08:21:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rake: ignore rubocop Style/NumericPredicate in taskgen [puppet] - 10https://gerrit.wikimedia.org/r/427614
[08:23:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] rake: ignore rubocop Style/NumericPredicate in taskgen [puppet] - 10https://gerrit.wikimedia.org/r/427614 (owner: 10Filippo Giunchedi)
[08:23:44] <godog>	 apergos: ^ try rebasing, should be working now
[08:23:52] <apergos>	 great thank you
[08:24:18] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141990 (10EddieGP) p:05Unbreak!>03Low  Per aarons comment, just logspam. Seems the actual problem for renames was nutcracker, which I fixed in T192473#4141...
[08:24:47] <wikibugs>	 (03PS4) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[08:26:44] <apergos>	 yep we're back in business
[08:29:50] <wikibugs>	 (03PS1) 10Vgutierrez: pybal: Re-enable bgp in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/427615 (https://phabricator.wikimedia.org/T191897)
[08:32:47] <wikibugs>	 (03PS5) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[08:35:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable bgp in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/427615 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[08:40:07] <vgutierrez>	 !log Repool (Re-enable BGP) lvs4006 - T191897
[08:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:12] <stashbot>	 T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897
[08:43:58] <wikibugs>	 (03PS6) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[08:45:16] <wikibugs>	 (03PS3) 10Ema: role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136)
[08:46:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431)
[08:46:44] <wikibugs>	 (03CR) 10Ema: [C: 032] role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) (owner: 10Ema)
[08:46:57] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431)
[08:47:39] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142017 (10Vgutierrez)
[08:48:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff)
[08:49:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069)
[08:49:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Also handle Prometheus exporters in app server decom script [puppet] - 10https://gerrit.wikimedia.org/r/427340
[08:49:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi)
[08:50:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Also handle Prometheus exporters in app server decom script [puppet] - 10https://gerrit.wikimedia.org/r/427340 (owner: 10Muehlenhoff)
[08:50:03] <wikibugs>	 (03PS7) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[08:50:13] <apergos>	 some days I am not meant to write even the simplest amount of puppet code. clearly today is one of those days
[08:52:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rubocop: display cop names [puppet] - 10https://gerrit.wikimedia.org/r/427619
[08:53:36] <vgutierrez>	 apergos: well.. on Tuesday I hit almost every use case of the jenkins commit validator.. I got to PS4 to get the commit message right /o\
[08:54:22] <wikibugs>	 (03PS3) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069)
[08:54:39] <apergos>	 I think I've only pissed off the commit message checker once
[08:54:49] <apergos>	 now I've jinxed it of course
[09:00:25] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897)
[09:01:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[09:01:50] <wikibugs>	 (03PS2) 10Vgutierrez: install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897)
[09:02:52] <icinga-wm>	 PROBLEM - DPKG on mw1276 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:02:52] <icinga-wm>	 PROBLEM - DPKG on mw1279 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:03:20] <moritzm>	 !log upgrading API server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854)
[09:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:28] <stashbot>	 T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854
[09:03:31] <moritzm>	 ^ that's me, forgot to silence in Icinga
[09:03:34] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4142093 (10MarcoAurelio) 👍
[09:03:42] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:03:52] <icinga-wm>	 RECOVERY - DPKG on mw1276 is OK: All packages OK
[09:03:52] <icinga-wm>	 RECOVERY - DPKG on mw1279 is OK: All packages OK
[09:04:32] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 77206 bytes in 0.155 second response time
[09:06:32] <vgutierrez>	 !log Depool and reimage lvs4005 as stretch - T191897
[09:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:38] <stashbot>	 T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897
[09:10:59] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857#4142097 (10Volans) 05Open>03Resolved To summarize the work done recently, I've made an audit of existing checks and fixed/improv...
[09:14:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: alert on edac uncorrectable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177)
[09:14:28] <moritzm>	 !log installing Java security updates on maps* plus rolling restart of Cassandra to pick up new JRE
[09:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:50] <akosiaris>	 !log reboot puppetdb1001 for cache=none setting apply. T181121
[09:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:56] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[09:26:31] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4142112 (10fgiunchedi) I've taken a first stab at reporting uncorrectable errors in https://gerrit.wikimedia.org/r/c/422110/ as reported by the kernel, so at least...
[09:27:47] <icinga-wm>	 PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:47] <icinga-wm>	 PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:47] <icinga-wm>	 PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:47] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:52] <icinga-wm>	 PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:52] <icinga-wm>	 PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:27:53] <icinga-wm>	 PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:02] <icinga-wm>	 PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:12] <icinga-wm>	 PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:43] <icinga-wm>	 PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:43] <icinga-wm>	 PROBLEM - puppet last run on ping1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:43] <icinga-wm>	 PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:52] <icinga-wm>	 PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:52] <icinga-wm>	 PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:52] <icinga-wm>	 PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:52] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:53] <icinga-wm>	 PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:53] <icinga-wm>	 PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:54] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:54] <icinga-wm>	 PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:55] <icinga-wm>	 PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:55] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:56] <icinga-wm>	 PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:56] <icinga-wm>	 PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:28:57] <icinga-wm>	 PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:12] <icinga-wm>	 PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:13] <icinga-wm>	 PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:13] <icinga-wm>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:13] <icinga-wm>	 PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:33] <akosiaris>	 !log stop ircecho for a while, puppetdb1001 reboot was eventful
[09:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:01] <akosiaris>	 !log start a force puppet run in all of eqiad with a batch size of 30
[09:31:04] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142131 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2002.codfw.wmnet']...
[09:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:53] <wikibugs>	 (03PS2) 10Elukey: role::configcluster: upgrade zookeeper main-eqiad to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924)
[09:32:09] <elukey>	 mobrovac: all right I am ready
[09:32:16] <elukey>	 let's chat in here
[09:32:49] <mobrovac>	 ok vamos elukey
[09:33:32] <elukey>	 !log upgrade zookeper on conf100[123] from 3.4.5 to 3.4.9 - T182924 
[09:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:38] <stashbot>	 T182924: Refresh zookeeper nodes in eqiad - https://phabricator.wikimedia.org/T182924
[09:33:48] <volans>	 elukey: FYI ircecho is still stopped
[09:34:33] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142137 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4005.ulsfo.wmnet ``` The log can be found in `/var/lo...
[09:35:35] <elukey>	 volans: thanks for the reminder, will watch icinga
[09:36:01] <elukey>	 mobrovac: doing sanity checks
[09:36:53] <elukey>	 ok so added downtime, puppet disabled on conf100[123], verified that cluster is working
[09:37:00] <elukey>	 1001/2 are followers, 1003 leader
[09:37:22] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[09:37:42] <wikibugs>	 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4142144 (10fgiunchedi) >>! In T175361#4140768, @herron wrote: > Indeed, after upgrading to `3.0.0~rc5-1~bpo9+1` mtail starts up happily. >  > @fgiunchedi do you think it would be safe to pin the mtail package to...
[09:38:22] <elukey>	 proceeding with conf1001
[09:39:27] <mobrovac>	 k
[09:41:21] <elukey>	 1001 upgraded, all good so far
[09:42:05] <mobrovac>	 elukey: let's wait 3,4 minutes before proceeding
[09:42:18] <elukey>	 yep yep
[09:42:27] <wikibugs>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4142148 (10ArielGlenn) I'm planning to try disabling the nfs attribute cache for files and directories on one of the snapshots for t...
[09:42:28] <mobrovac>	 i want to make sure all is still good on our side
[09:43:18] <wikibugs>	 (03CR) 10Gehel: [C: 031] [cirrus] Increase the number of shards for wikidatawiki_content, enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427176 (https://phabricator.wikimedia.org/T192064) (owner: 10DCausse)
[09:44:01] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4142149 (10MarcoAurelio) jobqueue at beta is down again; see <https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress?username=Hauskatze...
[09:45:32] <mobrovac>	 ok elukey, looking good, let's proceed
[09:47:10] <elukey>	 mobrovac: ack, kafka/burrow metrics looks good
[09:49:19] <elukey>	 upgrading 1002
[09:49:22] <wikibugs>	 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142156 (10akosiaris)
[09:49:38] <wikibugs>	 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142168 (10akosiaris) p:05Triage>03High
[09:49:57] <wikibugs>	 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142156 (10akosiaris)
[09:50:28] <elukey>	 done
[09:51:22] <moritzm>	 !log rolling restart of Cassandra on maps completed
[09:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:28] <moritzm>	 ^ gehel
[09:51:49] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4142170 (10EddieGP)
[09:52:12] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[09:52:41] <icinga-wm>	 RECOVERY - puppet last run on bast1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[09:52:42] <icinga-wm>	 RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:53:51] <icinga-wm>	 RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:54:01] <icinga-wm>	 RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[09:54:11] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[09:54:42] <icinga-wm>	 RECOVERY - puppet last run on ms1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:54:46] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142197 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2002.codfw.wmnet'] ```  and were **ALL** successful.
[09:54:56] <vgutierrez>	 !log Updating puppet compiler facts
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:02] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:55:08] <akosiaris>	 !log reboot ganeti VMs on row_B in codfw for cache=none setting. T181121
[09:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:14] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[09:55:46] <mobrovac>	 elukey: ok let's proceed with the lead now?
[09:55:51] <elukey>	 mobrovac: ack
[09:55:59] <elukey>	 didn't see any glitch in my metrics
[09:56:02] <icinga-wm>	 RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[09:56:11] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:56:32] <icinga-wm>	 RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[09:56:32] <icinga-wm>	 RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[09:56:41] <icinga-wm>	 RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:56:41] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[09:56:51] <icinga-wm>	 RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[09:57:02] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[09:57:02] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:57:53] <icinga-wm>	 RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:57:54] <icinga-wm>	 RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[09:58:48] <elukey>	 mobrovac: done, new leader is 1002
[09:58:53] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[09:59:04] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[09:59:30] <elukey>	 !log complete migration of zookeeper on conf100[123]
[09:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:53] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:01:35] <mobrovac>	 elukey: all seems good from my pov
[10:02:11] <elukey>	 mobrovac: from mine too, thanks for the support! 
[10:02:40] <mobrovac>	 thnx for pushing this through elukey!
[10:03:28] <elukey>	 mobrovac: hope that you will not hate me by the end of next week when we'll swap conf100[123] with conf100[456] :D
[10:04:26] <mobrovac>	 elukey: maybe i should go on vacations then? :D
[10:04:39] <elukey>	 ahahha
[10:04:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: base: alert on edac (un)correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177)
[10:07:21] <wikibugs>	 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142238 (10akosiaris)
[10:15:37] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4005.ulsfo.wmnet'] ```  and were **ALL** successful.
[10:23:04] <wikibugs>	 (03PS5) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924)
[10:23:24] <wikibugs>	 (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/427629 (https://phabricator.wikimedia.org/T191897)
[10:23:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/427629 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[10:27:32] <vgutierrez>	 !log Repool (Re-enable BGP) lvs4005 - T191897
[10:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:38] <stashbot>	 T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897
[10:30:57] <wikibugs>	 (03PS1) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655)
[10:34:20] <moritzm>	 !log upgrading API servers mw1221-mw1235 to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[10:34:24] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Make tilerator_storage_id to kartotherian (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[10:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:49] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142329 (10Vgutierrez)
[10:39:47] <wikibugs>	 (03CR) 10Sbisson: [C: 031] "It's time to remove those." [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel)
[10:41:23] <wikibugs>	 (03PS2) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655)
[10:41:31] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: clean-up ulsfo lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897)
[10:45:14] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy and shows the expected noop: https://puppet-compiler.wmflabs.org/compiler02/10978/" [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[10:48:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634
[10:49:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] hieradata: clean-up ulsfo lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez)
[10:50:57] <marostegui>	 jouncebot: next
[10:50:58] <jouncebot>	 In 2 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1300)
[10:54:02] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635
[10:56:34] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148)
[10:57:18] <marostegui>	 jynus: I will go after you
[10:57:28] <jynus>	 ok, then merging now
[10:57:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo)
[10:58:52] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo)
[10:59:10] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo)
[10:59:15] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:00:27] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:01:33] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2076 (duration: 01m 18s)
[11:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:56] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113:3315 for alter table (duration: 01m 16s)
[11:03:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:27] <jynus>	 !log starting reimage of db2076
[11:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:48] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:05:18] <marostegui>	 !log Deploy schema change on db1113:3315 - T191519 T188299 T190148
[11:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:26] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[11:05:26] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[11:05:26] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[11:07:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634 (owner: 10Alexandros Kosiaris)
[11:09:46] <logmsgbot>	 !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 17s)
[11:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:01] <wikibugs>	 (03CR) 10jenkins-bot: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634 (owner: 10Alexandros Kosiaris)
[11:11:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639
[11:11:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris)
[11:12:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris)
[11:14:29] <logmsgbot>	 !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: T181121 (duration: 01m 16s)
[11:14:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142413 (10Marostegui) No more errors for the last 6 hours after killing atop. Also no drops or connections errors running the RX original buffers after reverting them as c...
[11:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:35] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[11:16:03] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:04] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:23] <icinga-wm>	 PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:24] <icinga-wm>	 PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:24] <icinga-wm>	 PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:31] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris)
[11:16:34] <icinga-wm>	 PROBLEM - puppet last run on mc2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:43] <icinga-wm>	 PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:43] <icinga-wm>	 PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:44] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:53] <icinga-wm>	 PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:16:54] <icinga-wm>	 PROBLEM - puppet last run on ms-be2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:17:04] <icinga-wm>	 PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:17:34] <icinga-wm>	 PROBLEM - puppet last run on mw2272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:17:43] <icinga-wm>	 PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:17:44] <icinga-wm>	 PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:18:13] <icinga-wm>	 PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:18:33] <wikibugs>	 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142435 (10Gilles) How frequent were they as of late, before the change?
[11:19:03] <icinga-wm>	 PROBLEM - puppet last run on mw2268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:03] <icinga-wm>	 PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:03] <icinga-wm>	 PROBLEM - puppet last run on ganeti2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:09] <Hauskatze>	 hmm
[11:19:13] <icinga-wm>	 PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:13] <icinga-wm>	 PROBLEM - puppet last run on elastic2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:19:15] <volans>	 expected
[11:19:23] <icinga-wm>	 PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:20:49] <moritzm>	 !log upgrading app servers mw1238-mw1258 to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[11:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:30] <marostegui>	 !log Sanitize lfnwiki - T183566
[11:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:36] <stashbot>	 T183566: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566
[11:24:13] <marostegui>	 !log Run check_private_data on labsdb - T183566
[11:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:34] <icinga-wm>	 PROBLEM - DPKG on mw1246 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:25:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:25:34] <icinga-wm>	 RECOVERY - DPKG on mw1246 is OK: All packages OK
[11:26:03] <icinga-wm>	 RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[11:26:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 77207 bytes in 0.096 second response time
[11:27:08] <moritzm>	 ^ downtime expired, all fine
[11:28:23] <icinga-wm>	 PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Service[hhvm]
[11:39:29] <moritzm>	 !log upgrading eqiad video scalers to  MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[11:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:20] <icinga-wm>	 RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:44:11] <icinga-wm>	 RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[11:44:20] <icinga-wm>	 RECOVERY - puppet last run on elastic2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[11:44:20] <icinga-wm>	 RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[11:44:21] <icinga-wm>	 RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:46:01] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:46:20] <icinga-wm>	 RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:46:30] <icinga-wm>	 RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:46:30] <icinga-wm>	 RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:46:40] <icinga-wm>	 RECOVERY - puppet last run on mc2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:46:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4142485 (10MoritzMuehlenhoff) It's also still in puppet, BTW:  jmm@sarin:~$ sudo cumin lawr* 1 hosts will be targeted: lawrencium.eqiad.wmnet DRY...
[11:47:00] <icinga-wm>	 RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:47:01] <icinga-wm>	 RECOVERY - puppet last run on ms-be2034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:47:01] <icinga-wm>	 RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:47:01] <icinga-wm>	 RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:47:40] <icinga-wm>	 RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:47:41] <icinga-wm>	 RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:47:41] <icinga-wm>	 RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:48:10] <icinga-wm>	 RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:48:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:48:55] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645
[11:49:01] <icinga-wm>	 RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:49:10] <icinga-wm>	 RECOVERY - puppet last run on ganeti2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:49:10] <icinga-wm>	 RECOVERY - puppet last run on mw2268 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:50:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo)
[11:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo)
[11:51:38] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo)
[11:52:20] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures
[11:54:10] <icinga-wm>	 RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures
[11:55:54] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2076 (duration: 01m 16s)
[11:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:31] <wikibugs>	 10Operations: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4142506 (10akosiaris) 05Open>03Resolved All VMs have been migrated to using `cache=none`. I 'll resolve this successfully, hopefully we will not meet this issue again
[12:21:20] <wikibugs>	 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4142521 (10akosiaris) 05stalled>03Open With `cache=none` being set in all cluster for unrelated reasons, this is now unblocked.  In the meantime `jessie-backports` has upgrade to `2.8`. Fortunately the changelog[...
[12:21:23] <wikibugs>	 (03PS3) 10Matthias Mullie: Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[12:21:47] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 031] "LGTM, but lacking permissions to +2" [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[12:26:42] <wikibugs>	 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142530 (10fgiunchedi) Somewhat frequent but not a lot, I don't have exact numbers tho  {F17126546}
[12:29:35] <wikibugs>	 (03PS3) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655)
[12:33:52] <wikibugs>	 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142540 (10Gilles) Is the Apr 18 occurence on that screenshot after the change was deployed?
[12:39:48] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4142546 (10fgiunchedi) @mmodell for sure, I was reading the log and I wonder why architecture changed from all to any?
[12:40:06] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649
[12:40:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10BBlack) >>! In T191996#4139205, @Marostegui wrote: > For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: > ```...
[12:40:48] <godog>	 gilles: I'll forward you the cron emails if that's ok? looks simpler to me
[12:41:05] <gilles>	 godog: sure
[12:43:16] <godog>	 of course gmail doesn't allow forwarding conversations )o)
[12:43:20] <godog>	 mass-forward that is
[12:43:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo)
[12:44:52] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo)
[12:44:58] <godog>	 gilles: {{done}}
[12:45:02] <gilles>	 thanks
[12:46:48] <gilles>	 Date: Tue, Apr 17, 2018 at 11:31 AM - which TZ is that?
[12:47:00] <gilles>	 godog: ^
[12:47:30] <gilles>	 ah, nevermind, there are latter occurences
[12:48:27] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2075 (duration: 01m 16s)
[12:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:37] <godog>	 gilles: looks like my browser's so CET/CEST
[12:48:53] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo)
[12:49:39] <wikibugs>	 10Operations, 10monitoring: Improve remote IPMI monitoring - https://phabricator.wikimedia.org/T192547#4142570 (10Volans)
[12:49:52] <wikibugs>	 10Operations, 10monitoring: Improve remote IPMI monitoring - https://phabricator.wikimedia.org/T192547#4142580 (10Volans) p:05Triage>03Normal
[12:54:54] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142590 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2003.codfw.wmnet']...
[12:58:45] <jynus>	 !log starting reimage of db2075
[12:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1300).
[13:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:02:53] <Urbanecm>	 I'm here
[13:06:13] <Urbanecm>	 anybody to swat?
[13:07:51] <Reedy>	 I can if needed
[13:08:07] <Urbanecm>	 Reedy, well, nobody else's here apparently :)
[13:08:39] <wikibugs>	 (03PS2) 10Reedy: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm)
[13:08:46] <wikibugs>	 (03CR) 10Reedy: [C: 032] Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm)
[13:10:08] <wikibugs>	 (03Merged) 10jenkins-bot: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm)
[13:10:22] <wikibugs>	 (03CR) 10jenkins-bot: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm)
[13:10:27] <wikibugs>	 (03PS4) 10Reedy: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki)
[13:10:39] <wikibugs>	 (03CR) 10Reedy: [C: 032] Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki)
[13:11:57] <wikibugs>	 (03Merged) 10jenkins-bot: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki)
[13:13:00] <Reedy>	 No namespace dupes either
[13:13:25] <Urbanecm>	 Reedy, you mean...no need for the script?
[13:13:29] <Reedy>	 Yup
[13:13:32] <Reedy>	 0 conflicts
[13:13:46] <Urbanecm>	 that's good, isn't it :)
[13:14:55] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: T192427 T189277 (duration: 01m 17s)
[13:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:02] <stashbot>	 T192427: Enable $wgUseRCPatrol in hiwikiversity - https://phabricator.wikimedia.org/T192427
[13:15:02] <stashbot>	 T189277: Change aliases on ruwiki - https://phabricator.wikimedia.org/T189277
[13:15:19] <Urbanecm>	 thank you for the deploy Reedy !
[13:15:40] <Reedy>	 np
[13:15:41] <wikibugs>	 (03CR) 10jenkins-bot: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki)
[13:16:07] <wikibugs>	 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142631 (10Gilles) Looking at the content of the emails @fgiunchedi forwarded to me, the last offense right when the next cron kicks in after the restart and keeps going f...
[13:18:33] <wikibugs>	 (03CR) 10Ottomata: "Yar, I'm thikning this role class based targeting is not the best way to do this.  Pretty fragile and disconnected." [puppet] - 10https://gerrit.wikimedia.org/r/427596 (owner: 10Elukey)
[13:20:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552)
[13:20:52] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142646 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2003.codfw.wmnet'] ```  and were **ALL** successful.
[13:21:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi)
[13:22:05] <wikibugs>	 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4142647 (10MoritzMuehlenhoff) Or we could upgrade the Ganeti cluster to stretch? It provides qemu 2.8 out of the box.
[13:23:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "List of currently-affected hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi)
[13:25:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552)
[13:30:13] <moritzm>	 !log upgrading mw1334-mw1337 (job runners) to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[13:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:45] <Trey314159>	 !log reindexing serbian wikis on elastic@eqiad (T189265)
[13:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:51] <stashbot>	 T189265: Re-index Serbian Wikis - https://phabricator.wikimedia.org/T189265
[13:33:50] <marostegui>	 !log Start atop on db1114 - T191996
[13:33:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142669 (10Marostegui) >>! In T191996#4142547, @BBlack wrote: >  > Not that it's probably the issue here, but this probably isn't ideal.  If you look at `grep eno1 /proc/in...
[13:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:56] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[13:35:32] <wikibugs>	 (03PS1) 10Elukey: Release 0.10.1-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008)
[13:37:55] <wikibugs>	 (03PS2) 10Elukey: Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008)
[13:39:40] <marostegui>	 !log Stop atop on db1114 - T191996
[13:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:46] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[13:40:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142677 (10Marostegui) 05Open>03Resolved a:03Marostegui So, as soon as I started atop, errors came back and packets dropped. So the culprit is clearly `atop`. I am go...
[13:42:03] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659
[13:43:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142681 (10Marostegui)
[13:44:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661
[13:45:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo)
[13:46:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo)
[13:48:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo)
[13:50:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[13:50:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 (owner: 10Jcrespo)
[13:50:37] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661
[13:53:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove rendering from lvs::configuration::lvs_service_ips for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/427356
[13:53:19] <wikibugs>	 (03PS1) 10Gilles: Xenon: don’t generate SVGs for recently modified logs [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249)
[13:55:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove rendering from lvs::configuration::lvs_service_ips for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/427356 (owner: 10Muehlenhoff)
[13:56:48] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2075, depool db2074 (duration: 01m 16s)
[13:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:44] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 (owner: 10Jcrespo)
[14:00:38] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Marostegui)
[14:01:05] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142739 (10Marostegui)
[14:01:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129002 (10Marostegui)
[14:01:10] <wikibugs>	 (03PS2) 10Gehel: maps: remove sources.yaml [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948)
[14:02:44] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey)
[14:03:54] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142741 (10jcrespo) If I have to guess, I would say it is the combination of the stretch version + high load (if it is network, cpu or io, I cannot say)- I think enwiki API are hosts with logs of ongoing...
[14:04:01] <wikibugs>	 (03CR) 10Gehel: [C: 032] maps: remove sources.yaml [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel)
[14:04:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar)
[14:08:25] <wikibugs>	 (03PS1) 10Dzahn: releases-parsoid: fix directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672)
[14:09:03] <wikibugs>	 (03PS2) 10Gehel: maps: cleanup of sources.yaml code [puppet] - 10https://gerrit.wikimedia.org/r/423722 (https://phabricator.wikimedia.org/T112948)
[14:09:25] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669
[14:11:35] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui)
[14:12:04] <jynus>	 !log starting reimage of db2074
[14:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui)
[14:13:02] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui)
[14:13:41] <wikibugs>	 (03CR) 10Gehel: [C: 032] "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/10980/" [puppet] - 10https://gerrit.wikimedia.org/r/423722 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel)
[14:14:34] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142754 (10BBlack) When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like: ``` root     26337     1  0 00:00 ?        00:00:04 /usr/bin/atop -a -...
[14:14:51] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113:3315 after alter table (duration: 01m 16s)
[14:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:57] <wikibugs>	 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#4142755 (10MoritzMuehlenhoff) 05Open>03Resolved All traces of the image scalers are gone. There's some additional puppet refactoring to be done, but unrel...
[14:15:59] <wikibugs>	 (03PS2) 10Dzahn: releases-parsoid: add directory and fix  permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672)
[14:16:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Use a WMF-specific version number, not one from Debian backports [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/427670
[14:16:43] <wikibugs>	 (03PS3) 10Dzahn: releases-parsoid: add directory and fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672)
[14:17:08] <wikibugs>	 (03CR) 10Dzahn: "yep, sorry about that and thanks for reverting. i merged and something happened in RL that distracted me" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto)
[14:17:22] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142762 (10Marostegui)
[14:17:50] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Marostegui) @BBlack in the case of db1114 atop was normally running without causing any issues, but every 10 minutes it would spike for like 2-3 seconds using lots of the core to their 100% (T1...
[14:18:30] <wikibugs>	 (03PS4) 10Dzahn: releases-parsoid: add directory and fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672)
[14:18:42] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "re-revert of https://gerrit.wikimedia.org/r/#/c/427594/" [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:20:12] <wikibugs>	 (03CR) 10Dzahn: "fixed in https://gerrit.wikimedia.org/r/#/c/427668/" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto)
[14:22:51] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142770 (10faidon) >>! In T136732#4139610, @Ottomata wrote: > We could do that, but we wanted something centralized and reproducable (e.g. include a pu...
[14:24:19] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148)
[14:24:47] <wikibugs>	 (03PS1) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672
[14:25:33] <wikibugs>	 (03PS4) 10Gehel: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[14:25:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[14:26:20] <wikibugs>	 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4142789 (10Dzahn) - https://releases.wikimedia.org/parsoid/  has been create...
[14:27:17] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[14:28:47] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 for alter table (duration: 01m 13s)
[14:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:58] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[14:29:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Use a WMF-specific version number, not one from Debian backports [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/427670 (owner: 10Muehlenhoff)
[14:29:03] <marostegui>	 !log Deploy schema change on db1082 (this will generate lag on s5 on labs hosts) - T191519 T188299 T190148
[14:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:11] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[14:29:11] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[14:29:11] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[14:29:51] <wikibugs>	 (03CR) 10Gehel: [C: 032] "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10981/" [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[14:30:00] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142808 (10Ottomata) I don't have much context of how geowiki runs, but storing this in HDFS would be fine.  We (I?) just thought it would be better to...
[14:30:09] <marostegui>	 !log Star atop on db1114 without "-R" - T192551
[14:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:15] <stashbot>	 T192551: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551
[14:34:00] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142814 (10Marostegui) So atop is now running on db1114 like:  ``` root     30566  0.0  0.0  24712  7780 ?        S<Ls 14:29   0:00 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600 ```  I will report b...
[14:34:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi)
[14:36:54] <wikibugs>	 (03PS2) 10Gehel: wdqs: tune performance limits for the new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766)
[14:37:21] <wikibugs>	 (03PS1) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[14:37:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:38:11] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142823 (10fdans) Got it, yeah uploading to HDFS seems pretty sensible. The only documented application for this archive is history reconstruction, so...
[14:42:07] <wikibugs>	 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4142839 (10Vgutierrez) p:05Triage>03Normal
[14:42:14] <wikibugs>	 (03PS2) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[14:42:21] <wikibugs>	 (03CR) 10Gehel: [C: 032] "puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler03/10984/" [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel)
[14:42:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:42:51] <wikibugs>	 (03PS3) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[14:43:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:44:52] <wikibugs>	 (03PS4) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[14:45:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:46:00] <_joe_>	 mutante: I did revert your change from yesterday as it was making puppet fail on releases*
[14:46:07] <_joe_>	 the one introducing releases-parsoid
[14:46:12] <wikibugs>	 (03PS5) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[14:46:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[14:47:16] <Dereckson>	 !log Create bureaucrat account for [[User:Anderi Store]] on romd.wikimedia (T187184)
[14:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:22] <stashbot>	 T187184: WMF-hosted wiki request for Ro-Md Wikimedians user group - https://phabricator.wikimedia.org/T187184
[14:48:38] <Dereckson>	 !log Erratum: read "[[User:Andrei Stroe]]" and not "[[User:Anderi Store]]" for the previous entry (T187184)
[14:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:44] <wikibugs>	 10Operations, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284850 (10elukey)
[14:49:13] <revi>	 was there any problem with irc.wikimedia.org today?
[14:49:21] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#4142865 (10elukey) ping - status :)
[14:53:06] <wikibugs>	 (03PS1) 10Ottomata: Add certificates for kafka_test_broker and kafka_main-deployment-prep_broker [labs/private] - 10https://gerrit.wikimedia.org/r/427676 (https://phabricator.wikimedia.org/T167039)
[14:56:50] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Add certificates for kafka_test_broker and kafka_main-deployment-prep_broker [labs/private] - 10https://gerrit.wikimedia.org/r/427676 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:05:47] <wikibugs>	 (03PS1) 10Elukey: Reimage analytics1069 to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/427679 (https://phabricator.wikimedia.org/T192557)
[15:07:40] <wikibugs>	 (03CR) 10Elukey: [C: 032] Reimage analytics1069 to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/427679 (https://phabricator.wikimedia.org/T192557) (owner: 10Elukey)
[15:08:48] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4142938 (10JBennett) What's the data?  From our clicktracking efforts what will we be collecting?
[15:09:53] <wikibugs>	 (03PS1) 10Ottomata: Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039)
[15:09:59] <wikibugs>	 (03PS1) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361)
[15:10:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:10:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[15:10:30] <wikibugs>	 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142948 (10Vgutierrez)
[15:10:58] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:11:38] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:11:39] <icinga-wm>	 PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:11:41] <logmsgbot>	 !log sbisson@tin Started deploy [kartotherian/deploy@74121d5]: Deploy latest kartotherian with new i18n sources
[15:11:45] <wikibugs>	 (03PS2) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361)
[15:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:25] <wikibugs>	 (03CR) 10Ottomata: "No op in prod" [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:12:31] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:13:13] <wikibugs>	 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142965 (10Vgutierrez) p:05Triage>03Normal
[15:15:39] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4142985 (10thcipriani)
[15:16:59] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4142999 (10thcipriani) This would also give us a place to test various mwscripts used by scap with php7
[15:16:59] <logmsgbot>	 !log sbisson@tin Finished deploy [kartotherian/deploy@74121d5]: Deploy latest kartotherian with new i18n sources (duration: 05m 19s)
[15:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:10] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4143013 (10thcipriani)
[15:17:14] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4143012 (10thcipriani)
[15:17:28] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused
[15:20:28] <mutante>	 _joe_: yes, i saw. thank you. i merged and got distracted. sorry about that. it's fixed now
[15:21:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/ci for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/427683 (https://phabricator.wikimedia.org/T191771)
[15:21:10] <stephanebisson>	 scap question, where can I find a syntax reference for j2 templates?
[15:22:54] <wikibugs>	 (03PS1) 10ArielGlenn: keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177)
[15:23:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add component/ci for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/427683 (https://phabricator.wikimedia.org/T191771) (owner: 10Muehlenhoff)
[15:23:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[15:24:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[15:25:56] <stephanebisson>	 nevermind, I just saw the erb syntax config
[15:27:00] <wikibugs>	 (03CR) 10Muehlenhoff: "Note that this won't upgrade existing stretch systems, so please upgrade these after rolling out the patch so that we have it in sync acro" [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[15:28:01] <wikibugs>	 (03PS6) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[15:31:15] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active
[15:31:23] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686
[15:31:46] <wikibugs>	 (03CR) 10Herron: [C: 032] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[15:31:55] <wikibugs>	 (03CR) 10Herron: [C: 032] "will do!" [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[15:31:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational
[15:32:01] <wikibugs>	 (03PS3) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361)
[15:32:26] <wikibugs>	 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4143046 (10akosiaris) >>! In T150532#4142647, @MoritzMuehlenhoff wrote: > Or we could upgrade the Ganeti cluster to stretch? It provides qemu 2.8 out of the box.  I 'd rather not couple the 2 upgrades. Both need to b...
[15:33:05] <wikibugs>	 (03PS1) 10Andrew Bogott: labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339)
[15:33:34] <wikibugs>	 (03PS2) 10Andrew Bogott: labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339)
[15:33:43] <logmsgbot>	 !log sbisson@tin Started deploy [kartotherian/deploy@89c4ca9]: Deploy latest kartotherian with new i18n sources (take 2)
[15:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:56] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-a valid until 2018-08-17 16:11:26 +0000 (expires in 120 days)
[15:36:35] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.004 second response time on 10.64.0.32 port 9042
[15:36:48] <logmsgbot>	 !log sbisson@tin Finished deploy [kartotherian/deploy@89c4ca9]: Deploy latest kartotherian with new i18n sources (take 2) (duration: 03m 05s)
[15:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:14] <logmsgbot>	 !log sbisson@tin Started deploy [kartotherian/deploy@0a5a3ef]: Deploy latest kartotherian with new i18n sources (take 3)
[15:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott)
[15:38:05] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[15:38:26] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:38:45] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused
[15:38:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:39:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo)
[15:40:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo)
[15:40:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo)
[15:42:37] <logmsgbot>	 !log sbisson@tin Finished deploy [kartotherian/deploy@0a5a3ef]: Deploy latest kartotherian with new i18n sources (take 3) (duration: 05m 22s)
[15:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690
[15:43:31] <wikibugs>	 (03PS1) 10Gehel: maps: disable OSM replication during tile regeneration [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655)
[15:44:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott)
[15:44:06] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143070 (10mepps) @Dzahn I'm lookign for access to Pivot (especially https://pivot.wikimedia.org/#banner_activity_minutely and https://pivot.wikimedia.org/#pageviews-hourly), SWAP (...
[15:44:09] <logmsgbot>	 !log fdans@tin Started deploy [analytics/refinery@5d0f63f]: deploying to launch page preview job
[15:44:13] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2074 (duration: 01m 17s)
[15:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:06] <wikibugs>	 (03PS2) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690
[15:45:23] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143094 (10mepps) We want access for all of fr-tech actually for these purposes: https://phabricator.wikimedia.org/T181629
[15:45:23] <wikibugs>	 (03PS3) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690
[15:46:29] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693
[15:47:45] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[15:47:59] <marostegui>	 !log Deploy schema change on dbstore1002 (s5) - T191519 T188299 T190148
[15:48:01] <wikibugs>	 (03CR) 10Sbisson: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel)
[15:48:05] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active
[15:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:06] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[15:48:06] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[15:48:06] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[15:49:06] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: 9.41e+04 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:49:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui)
[15:49:26] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: 1.234e+05 ge 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:49:46] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: 1.223e+05 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:50:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui)
[15:50:43] <logmsgbot>	 !log fdans@tin Finished deploy [analytics/refinery@5d0f63f]: deploying to launch page preview job (duration: 06m 34s)
[15:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:11] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 after alter table (duration: 01m 16s)
[15:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:30] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui)
[15:52:47] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4143132 (10CCogdill_WMF) We're collecting click engagement off fundraising emails (actual fundraising appeals, or informational newsletter emails) that...
[15:53:35] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: (C)1e+05 ge (W)5e+04 ge 2.088e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:53:55] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: (C)5e+04 ge (W)3e+04 ge 4320 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:54:15] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: (C)5e+04 ge (W)3e+04 ge 3208 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:58:22] <wikibugs>	 (03PS2) 10Gehel: maps: disable OSM replication during tile regeneration [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655)
[15:58:44] <wikibugs>	 (03CR) 10Gehel: [C: 032] "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/10986/" [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel)
[15:59:35] <wikibugs>	 (03PS4) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690
[16:00:05] <jouncebot>	 godog, moritzm, and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:04:20] <wikibugs>	 (03PS2) 10ArielGlenn: keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177)
[16:05:42] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-b valid until 2018-08-17 16:11:09 +0000 (expires in 120 days)
[16:06:52] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.118 port 9042
[16:07:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: Target kafka jmx exporters by profiles instead of roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[16:08:16] <wikibugs>	 (03PS5) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690
[16:12:41] <wikibugs>	 (03CR) 10Ottomata: Target kafka jmx exporters by profiles instead of roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[16:12:43] <wikibugs>	 (03PS2) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672
[16:12:52] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:13:12] <icinga-wm>	 PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:13:22] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused
[16:13:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[16:13:43] <wikibugs>	 (03CR) 10Ottomata: "BTW, this will help avoid bugs like https://gerrit.wikimedia.org/r/#/c/427596/" [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[16:15:16] <wikibugs>	 (03PS3) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672
[16:16:12] <icinga-wm>	 RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational
[16:16:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1016 is OK: OK - cassandra-c is active
[16:16:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, IMO better to use find in this case" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[16:20:33] <gehel>	 !log shutting down tilerator on maps[12].* for maintenance - T191655
[16:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:39] <stashbot>	 T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655
[16:20:53] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2018-08-17 16:11:29 +0000 (expires in 119 days)
[16:21:22] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.34 port 9042
[16:22:12] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[16:22:12] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused
[16:22:23] <icinga-wm>	 PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:22:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:24:12] <icinga-wm>	 PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:28:00] <gehel>	 !log restarting tilerator on maps[12].* - T191655
[16:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:06] <stashbot>	 T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655
[16:28:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey)
[16:28:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[16:29:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[16:31:59] <wikibugs>	 (03PS1) 10Elukey: Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557)
[16:33:12] <elukey>	 41 hosts to go, will be looong...
[16:34:14] <moritzm>	 *pfft*, we have about 250 mw* servers to reimage :-)
[16:34:53] * elukey cries in a corner
[16:34:59] <Reedy>	 automate all the things!
[16:35:35] * elukey asks to Reedy some mercy
[16:35:37] <elukey>	 :D
[16:36:16] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4143289 (10Gehel) >>! In T185504#4131278, @Dzahn wrote: > `ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", S...
[16:37:32] <elukey>	 moritzm: let me know if I can help with the mw reimages, I can schedule some time during the next weeks to work on it
[16:43:14] <wikibugs>	 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4143318 (10herron) a:03herron That's odd, will spin up a test instance and attempt to reproduce
[16:45:32] <icinga-wm>	 RECOVERY - Check systemd state on restbase1007 is OK: OK - running: The system is fully operational
[16:46:13] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active
[16:46:50] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "LGTM, and confirmed by running `php -S localhost:34343` and checking via http://localhost:34343/docroot/noc/db.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott)
[16:49:22] <moritzm>	 elukey: it's fine, large scale reimages can start when we've fully sorted out the memcached situation, right now I'm mostly rolling out some systems to catch potential regressions
[16:50:14] <moritzm>	 !log uploaded tidy-0.99 to component/ci for apt.wikimedia.org/stretch-wikimedia (T191771)
[16:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:20] <stashbot>	 T191771: [REL1_30] Some parserTests fail on debian stretch using Tidy, because of a new version of libtidy - https://phabricator.wikimedia.org/T191771
[16:50:41] <elukey>	 moritzm: ack
[16:52:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on 10.64.0.230 port 9042
[16:52:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is OK: SSL OK - Certificate restbase1007-a valid until 2018-08-17 16:10:53 +0000 (expires in 119 days)
[16:59:00] <wikibugs>	 (03PS7) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:00:55] <wikibugs>	 (03PS8) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672)
[17:01:14] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "+ auto_sync cron job" [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn)
[17:02:33] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding dns for db1116-1123 [dns] - 10https://gerrit.wikimedia.org/r/427536 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson)
[17:02:35] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3065109 (10Imarlier) @fgiunchedi Mentioned in #wikimedia-perf that he thought he remembered there being a reason why submitting metrics via graphite wouldn't work.   First, her...
[17:05:58] <wikibugs>	 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4143465 (10Dzahn) In Hiera it is defined which is the currently "active" rel...
[17:06:31] <wikibugs>	 (03PS1) 10Herron: install_server: reinstall mx2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361)
[17:07:33] <wikibugs>	 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4143477 (10Dzahn) @ssastry I think this should have resolved the ticket. See...
[17:07:38] <wikibugs>	 (03CR) 10Herron: [C: 04-2] "not to be merged until mx2001 is depooled" [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron)
[17:13:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott)
[17:15:07] <wikibugs>	 (03Merged) 10jenkins-bot: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott)
[17:18:05] <logmsgbot>	 !log andrew@tin Synchronized docroot/noc/db.php: Moving labtestwikitech to m5, step 1 (duration: 01m 16s)
[17:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:07] <wikibugs>	 (03CR) 10jenkins-bot: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott)
[17:20:13] <logmsgbot>	 !log andrew@tin Synchronized wmf-config/db-codfw.php: Moving labtestwikitech to m5, step 2 (duration: 01m 16s)
[17:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:44] <logmsgbot>	 !log andrew@tin Synchronized wmf-config/db-eqiad.php: Moving labtestwikitech to m5, step 3 (duration: 01m 16s)
[17:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:50] <wikibugs>	 (03CR) 10Elukey: [C: 032] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey)
[17:33:31] <wikibugs>	 (03PS1) 10Andrew Bogott: m5: allow labtestweb2001 mysql access [puppet] - 10https://gerrit.wikimedia.org/r/427720 (https://phabricator.wikimedia.org/T192339)
[17:33:44] <wikibugs>	 (03PS4) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672
[17:35:55] <wikibugs>	 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4143586 (10herron) Ok, test mx instance is looking good.  Will plan to depool and reinstall mx2001 with Stretch next week.  @ayounsi could we coordinate a time to reject connections to mx200...
[17:37:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] m5: allow labtestweb2001 mysql access [puppet] - 10https://gerrit.wikimedia.org/r/427720 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott)
[17:42:52] <wikibugs>	 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4143632 (10Andrew)
[17:45:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 280.83 seconds
[18:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:01:11] * thcipriani uses window to catch-up train
[18:11:09] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143711 (10Nuria) @meeps: please note that the ticket you are linking to is also a request for access where I noted that these tools require different access levels, ALL of them giv...
[18:16:10] <thcipriani>	 AndyRussG: so I just looked at tin and saw that https://gerrit.wikimedia.org/r/#/c/427235/ was fetched for wmf.30 but not checked out
[18:16:40] <AndyRussG>	 thcipriani: hi!
[18:16:42] <AndyRussG>	 one sec
[18:16:46] <thcipriani>	 ok
[18:17:16] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4143719 (10ayounsi)
[18:18:34] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4062522 (10ayounsi) Verified that external monitoring doesn't do ping checks (but http, etc. instead) to hostnames (en.wikipedia.org, etc). Added a Watchmouse ping check for...
[18:18:38] <AndyRussG>	 thcipriani: the intention was that that patch, as well as https://gerrit.wikimedia.org/r/#/c/427439/, would go out during yesterday's morning SWAT
[18:19:20] <AndyRussG>	 stuff happened, not related to any issues with those patches, and somehow stuff only made it to group0. I didn't follow what happened with the train
[18:19:51] <AndyRussG>	 Both those can be pushed out everywhere with wmf.30, if it's not a bother
[18:20:38] <thcipriani>	 I can make them live now, but it doesn't look like they are currently live anywhere
[18:20:46] <thcipriani>	 do you want to test them?
[18:20:58] <AndyRussG>	 thcipriani: they are, or were, on mediawiki.org yesterday
[18:21:17] <AndyRussG>	 We did test the first (the one you mentioned) on prod, via mewdebug1002
[18:21:48] <AndyRussG>	 The second one can only be really tested on meta, which wasn't updated. But it's very, very minimal, so I'd recommend just going ahead and pushing it all out
[18:22:11] <thcipriani>	 hrm hold on one second, maybe the way this was fetched was just weird
[18:22:41] <AndyRussG>	 Hmmm
[18:22:45] <AndyRussG>	 Anyway, once the train has traversed its silicone track (and full scap is done) I can double-check that it's all good
[18:25:25] <thcipriani>	 AndyRussG: false alarm. So when I fetched down changes for mediawiki core for 1.30.0-wmf.30 the submodule bumps for CentralNotice came down, but I guess the changes for the actual extension were already fetched, just not the submodule bumps on core
[18:25:58] <thcipriani>	 tl;dr: git looked weird, but wmf.30 looks up-to-date on the appservers
[18:32:19] <rxy>	 testwki is now broken?
[18:32:25] <rxy>	 https://test.wikipedia.org/wiki/File:Rxy.svg.png
[18:32:46] <rxy>	 [WtjgxgpAICsAAFIqsawAAADM] /wiki/File:Rxy.svg.png InvalidArgumentException from line 875 of /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/loadbalancer/LoadBalancer.php: No server with index 'Array'.
[18:32:57] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4143748 (10mmodell) @fgiunchedi: I'm not sure, the commit says it's for py3 transition, but I'm not sure why it matters. @demon, can...
[18:32:58] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.30/resources/src/jquery: [[gerrit:427709|jquery.makeCollapsible: Only add "[" "]" to autogenerated toggles]] T192140 (duration: 01m 17s)
[18:33:02] <paladox>	 works for me
[18:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:05] <stashbot>	 T192140: Square brackets shown around the expand/collapse icons on RC - https://phabricator.wikimedia.org/T192140
[18:33:17] <rxy>	 hmm private mode is ok..
[18:35:50] <rxy>	 https://phabricator.wikimedia.org/P7018
[18:36:44] <rxy>	 probably occurs error to file uploader?
[18:38:00] <rxy>	 alt account is ok.
[18:38:19] <rxy>	 rxy is still occurs error
[18:38:46] <thcipriani>	 rxy: could you file a task for that? I definitely see that error in the logs.
[18:39:09] <rxy>	 k...
[18:45:37] <rxy>	 https://phabricator.wikimedia.org/T192584
[18:45:44] <thcipriani>	 thanks!
[18:45:49] <rxy>	 thanks too
[18:51:00] <thcipriani>	 added as a blocker for 1.31.0-wmf.30 train rollout
[18:51:25] <rxy>	 thx :)
[18:51:37] <Reedy>	 thcipriani: I've found it
[18:51:38] <Reedy>	 	public static function newFromConds(
[18:51:38] <Reedy>	 		$conds,
[18:51:38] <Reedy>	 		$fname = __METHOD__,
[18:51:39] <Reedy>	 		$dbType = DB_REPLICA
[18:51:41] <Reedy>	 	) {
[18:51:46] <Reedy>	 in Article.php
[18:51:48] <Reedy>	 					[ 'USE INDEX' => 'rc_timestamp' ]
[18:52:30] <Reedy>	 https://github.com/wikimedia/mediawiki/blame/master/includes/page/Article.php#L1069
[18:52:37] <Reedy>	 I guess RecentChange has changed
[18:53:35] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4143821 (10RStallman-legalteam) Tim's NDA is fully signed and on file with legal. Thanks!
[18:57:37] <Reedy>	 It kinda looks like the code hasn't changed recently
[18:59:25] <rxy>	 hmm  same error at mediawiki.org  too
[18:59:34] <thcipriani>	 is this just a longstanding bug that's surfacing just now?
[18:59:54] <Reedy>	 It could be, yup
[19:00:04] <jouncebot>	 thcipriani: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:26] <thcipriani>	 so far my horoscope is correct, I guess.
[19:00:45] <rxy>	 sorry for find out this... or .. it is good timing?
[19:02:18] <Reedy>	 https://gerrit.wikimedia.org/r/#/c/427751/
[19:02:24] <Reedy>	 That'll fix it...
[19:02:50] <Reedy>	 It's almost like there's a parameter been removed from RecentChange::newFromConds
[19:03:27] <Reedy>	 thcipriani: Unless something has changed with patrolling config recently
[19:03:38] <Reedy>	 That'd probably be the more likely answer
[19:05:56] <rxy>	 "something has changed with patrolling config recently"->  https://phabricator.wikimedia.org/T184791  >
[19:05:59] <rxy>	 ?
[19:08:07] <Reedy>	 thcipriani: https://gerrit.wikimedia.org/r/427755 want to try that on .30?
[19:09:36] <thcipriani>	 Reedy: sure
[19:09:58] <Reedy>	 Looks very likely that's a bug that's been sitting there for yeaaaars
[19:11:02] <rxy>	 interesting
[19:15:41] <wikibugs>	 (03PS2) 10Dzahn: Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad)
[19:16:08] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad)
[19:17:14] <wikibugs>	 (03PS7) 10Dzahn: Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox)
[19:18:56] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox)
[19:19:00] <paladox>	 thanks :)
[19:19:16] <no_justification>	 Teamwork! 
[19:19:50] <mutante>	 yep, paladox :)
[19:19:56] <paladox>	 :)
[19:20:03] <mutante>	 applied on gerrit2001
[19:20:20] <paladox>	 no_justification wondering could you do that gerrit wmfcontent thing please :)
[19:20:25] <paladox>	 domain)
[19:22:23] <Reedy>	 thcipriani: it merged
[19:22:28] <no_justification>	 !log gerrit: restarting services to pick up gc & indexing changes
[19:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:01] <thcipriani>	 Reedy: just saw, but ^ gerrit is restarting so I can't fetch just now :)
[19:23:06] <Reedy>	 lols
[19:26:21] <mutante>	 gerrrit back
[19:27:12] <icinga-wm>	 PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[19:27:20] <wikibugs>	 (03PS3) 10Dzahn: Gerrit: Move all logging to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad)
[19:27:55] <thcipriani>	 Reedy: ok, got it, pulled over to mwdebug1002
[19:28:00] <thcipriani>	 anything to test there?
[19:28:01] <wikibugs>	 (03CR) 10Paladox: [C: 031] "I've been running this on gerrit-test3 for a long while 2+ weeks now. and has been working." [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad)
[19:28:16] <Reedy>	 thcipriani: You can visit the file page that rxy linked and check it doesn't blow up
[19:28:21] <Reedy>	 (it should work)
[19:28:47] <paladox>	 that will require another restart, as log4j 1.x does not reload like log4j2 does
[19:32:03] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build]
[19:32:03] <thcipriani>	 Reedy: pretty sure that page always did work for me
[19:32:09] <Reedy>	 orly?
[19:32:11] <Reedy>	 Let me check
[19:32:25] <Reedy>	 https://test.wikipedia.org/wiki/File:Rxy.svg.png broken normally...
[19:32:34] <Reedy>	 good on mwdebug1002
[19:32:36] <Reedy>	 tested
[19:32:37] <Reedy>	 SHIP IT
[19:32:49] <thcipriani>	 :)
[19:32:50] <thcipriani>	 k
[19:33:16] <Reedy>	 Depending on how much longer .29 is hanging around can decide whether we merge to there too
[19:34:51] <thcipriani>	 hopefully wmf.29 will be gone today, but at the rate I'm going...
[19:35:33] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143951 (10AndyRussG) >>! In T192472#4143711, @Nuria wrote: > @meeps: please note that the ticket you are linking to is also a request for access where I noted that these tools requ...
[19:35:47] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.30/includes/page/Article.php: [[gerrit:427755|Do not pass USE INDEX to a $dbType parameter]] T192584 (duration: 01m 17s)
[19:35:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:53] <stashbot>	 T192584: Error occurs in file page for Own uploaded files@1.31.0-wmf.30 (e8360e8) - https://phabricator.wikimedia.org/T192584
[19:36:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4143977 (10Eevans) Update:  The decommission of restbase1010-c was discontinued after other instances in the rack began to fail...
[19:38:26] <wikibugs>	 (03PS1) 10Herron: puppetdb: add service enable => true [puppet] - 10https://gerrit.wikimedia.org/r/427772 (https://phabricator.wikimedia.org/T192531)
[19:39:10] <wikibugs>	 (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773
[19:39:22] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4143992 (10herron) Sure enough, it's reproducible.  Looks like the `systemd::service` entry used previously automatically set `enable => true` when called with `ensure => present` whic...
[19:40:57] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani)
[19:41:26] <rxy>	 Reedy: ok for now . thanks :)
[19:42:11] <wikibugs>	 (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani)
[19:45:07] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.30
[19:45:19] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144006 (10Pnorman) The answer in general is it depends. Are you looking for monitoring to diagnose problems, or alarms for health?  I would recommend monitoring for maximum tra...
[19:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:19] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183)
[19:49:17] <wikibugs>	 (03CR) 10Hashar: "That requires the cumin master to use Stretch. On Jessie there are a bunch of apt / dependencies issues." [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[19:49:24] <wikibugs>	 (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani)
[19:53:08] <logmsgbot>	 !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.30 (duration: 01m 15s)
[19:53:13] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144043 (10Nuria) Access approved on my end.
[19:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:02] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:12] <icinga-wm>	 RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[20:03:35] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144066 (10Pnorman) Adding to the above, I would say that most of the other monitoring that can be done can be broken down into performance related metrics, like transactions pe...
[20:09:22] <hashar>	 thcipriani: how is the train going?
[20:09:55] <thcipriani>	 hashar: group0 + group1 done. Will roll forward all wikis shortly.
[20:10:10] <hashar>	 thcipriani: I will bring back quibble after taht I guess
[20:10:40] <thcipriani>	 hashar: ok, I'll ping you when I'm done.
[20:13:27] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144083 (10mepps) Thanks @nuria and @AndyRussG! Is there a next step @Dzahn @Reedy?
[20:16:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542)
[20:20:09] <wikibugs>	 (03Abandoned) 10Gilles: Xenon: don’t generate SVGs for recently modified logs [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[20:21:12] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144102 (10Gilles) Hah, turns out I actually didn't ever truly run xenon-generate-svgs locally on that file, when I did it failed just like in produc...
[20:22:45] <logmsgbot>	 !log milimetric@tin Started deploy [analytics/refinery@c1c9885]: Correcting hql from last deployment
[20:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:26] <wikibugs>	 (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815
[20:27:54] <logmsgbot>	 !log milimetric@tin Finished deploy [analytics/refinery@c1c9885]: Correcting hql from last deployment (duration: 05m 09s)
[20:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:32] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: 5.329e+04 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:29:53] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani)
[20:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani)
[20:31:32] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: (C)5e+04 ge (W)3e+04 ge 3796 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:32:38] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.30
[20:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:09] <thcipriani>	 hashar: train is done
[20:38:39] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144203 (10Gilles) OK, the error points to the last line, because that's where the file cursor is, but the offending line happened earlier. It's this...
[20:40:46] <wikibugs>	 (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani)
[20:43:42] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[20:48:12] <urandom>	 !log restarting cassandra to (temporarily) rollback prometheus jmx exporter -- T189822, T192456
[20:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:19] <stashbot>	 T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456
[20:48:20] <stashbot>	 T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822
[20:48:24] <urandom>	 !log restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-a -- T189822, T192456
[20:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:12] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144267 (10Gilles) Finally found the offending line, which is just "1" and happened in the middle of the file. It's possible that it was written by t...
[20:55:43] <hashar>	 thcipriani: awesome. Bringing quibble back
[21:00:19] <ebernhardson>	 !move issue move of enwiki_content shard 2 from overloaded elasti1027 to elastic1017
[21:00:23] <ebernhardson>	 !log issue move of enwiki_content shard 2 from overloaded elasti1027 to elastic1017
[21:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:00] <wikibugs>	 (03PS1) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249)
[21:04:18] <wikibugs>	 (03PS2) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249)
[21:07:52] <wikibugs>	 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4144324 (10ayounsi) @herron Monday 10:30am PDT? (5:30pm UTC) How long will the block be installed for?
[21:08:52] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[21:11:56] <urandom>	 !log restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-c -- T189822, T192456
[21:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:05] <stashbot>	 T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456
[21:12:05] <stashbot>	 T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822
[21:15:54] <urandom>	 !log Start cleanup, restbase10{07,11,16}-a -- T189822
[21:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:52] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144383 (10Dzahn) @mepps Yea, the next step would be that we need a SSH key from you. Could you create one (https://wikitech.wikimedia.org/wiki/Production_shell_access#SSH_Key_Requi...
[21:22:38] <urandom>	 !log Start cleanup, restbase10{07,11,16}-b -- T189822
[21:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:45] <stashbot>	 T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822
[21:25:54] <wikibugs>	 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4144406 (10herron) >>! In T175361#4144324, @ayounsi wrote: > @herron Monday 10:30am PDT? (5:30pm UTC) > How long will the block be installed for?  Sounds good!  Barring any unexpected issues...
[21:26:52] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:27:02] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[21:27:57] <paladox>	 wikipedia is sloww
[21:28:31] <paladox>	 https://en.wikipedia.org/wiki/IOS 
[21:28:33] <paladox>	 is slow
[21:29:30] <Platonides>	 did you try with https://en.wikipedia.org/wiki/Microsoft_Windows ? :) 
[21:29:37] <paladox>	 I use mac
[21:29:45] <paladox>	 yeh
[21:29:48] <paladox>	 i went to that page
[21:29:56] <paladox>	 and clicking edit is causing it to slowly load
[21:30:04] <Wiki13>	 I can confirm the loading issues here too
[21:31:09] <Platonides>	 page loads fine here
[21:31:30] <paladox>	 Seems to work now.
[21:32:09] <Wiki13>	 yea, it seems to one of those quirks that esams has had for quite a while now
[21:33:46] <mutante>	 ack, there was a short spike of more 5xx on the graph linked above
[21:37:15] <wikibugs>	 (03CR) 10Herron: [C: 032] puppetdb: add service enable => true [puppet] - 10https://gerrit.wikimedia.org/r/427772 (https://phabricator.wikimedia.org/T192531) (owner: 10Herron)
[21:37:53] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:38:03] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[21:41:45] <urandom>	 !log Start cleanup, restbase10{07,11,16}-c -- T189822
[21:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:52] <stashbot>	 T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822
[21:45:07] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4144420 (10herron) 05Open>03Resolved Fixed!  ``` Notice: /Stage[main]/Puppetdb::App/Service[puppetdb]/enable: enable changed 'false' to 'true' ```  ``` UNIT FILE...
[21:45:46] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144422 (10Dzahn) Thanks @Gehel and @pnorman!  I would say let's start with this one:  >>! In T185504#4143289, @Gehel wrote: > It looks like the script is trying to connect over...
[21:48:18] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144425 (10Krinkle) I don't know if the cited portion was changed unintentionally, but that sample does not show a `1` on its own line. It shows a `1...
[21:50:26] <wikibugs>	 (03CR) 10Krinkle: Filter out invalid records in xenon-log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[21:51:13] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144436 (10Gehel) >>! In T185504#4144422, @Dzahn wrote: > Looking at the netbox module it seems /etc/postgresql/9.6/main/pg_hba.conf isn't puppetized while it does contain custo...
[21:52:02] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4144438 (10MarcoAurelio) The error mentioned is gone, thanks. However we still have issues: T1...
[22:03:52] <Krinkle>	 mutante: I can confirm the 5xx spike on load.php as well
[22:03:55] <Krinkle>	 Do we know what caused it?
[22:05:20] <Krinkle>	 about 10K failed requests in total over that 10min to load.php
[22:06:07] <Krinkle>	 (9min*60sec*19 503s/sec)
[22:06:18] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/resourceloader?refresh=5m&orgId=1&from=1524171141986&to=1524174448108
[22:06:25] <Krinkle>	 It's 0 before and after that spike
[22:07:51] <Krinkle>	 logstash mediawiki-errors went from 4K/min to 20K/min and has not come down since
[22:07:57] <Krinkle>	 twentyafterfour: I think that means train rollback, right?
[22:08:25] <Krinkle>	 https://usercontent.irccloud-cdn.com/file/GJ4Kc29S/Screen%20Shot%202018-04-19%20at%2023.08.16.png
[22:09:10] <bblack>	 the logstash shape does seem to roughly take off same time as "Group1 to 1.31.0-wmf.30"
[22:10:29] <Krinkle>	 It seems 95% of it is this one: domain=127.0.0.1 url=runJobs.php channel=CirrusSearch message="Search backend error"
[22:11:18] <Krinkle>	 ebernhardson: might be related to your es changes?
[22:11:33] <Krinkle>	 https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors last 4 hours
[22:12:29] <wikibugs>	 (03PS4) 10BBlack: ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[22:14:57] <wikibugs>	 (03CR) 10BBlack: [C: 032] ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[22:16:19] <wikibugs>	 (03PS1) 10BBlack: fb traffic experiment: reduce to 1h [puppet] - 10https://gerrit.wikimedia.org/r/427821
[22:16:36] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] fb traffic experiment: reduce to 1h [puppet] - 10https://gerrit.wikimedia.org/r/427821 (owner: 10BBlack)
[22:18:25] <wikibugs>	 (03PS4) 10BBlack: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema)
[22:19:00] <wikibugs>	 (03CR) 10BBlack: [C: 032] Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema)
[22:25:38] <thcipriani>	 I'm on train this week, I can roll back, but I didn't see a spike in fatal monitor, didn't realize there was a spike in mediawiki-error rate
[22:30:27] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 and group2 wikis back to 1.31.0-wmf.29
[22:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[22:30:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:30:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:30:43] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:30:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:31:02] <thcipriani>	 ok, we're back on wmf.29 for group1 and group2
[22:31:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:31:42] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:31:58] <bblack>	 this seems more like a prometheus failure than a varnish failure, but looking
[22:32:38] <bblack>	 (perhaps those should be UNKNOWN rather than CRITICAL?)
[22:34:32] <bblack>	 yeah, bast3002 seems in some kind of trouble (which is where prometheus goes through)
[22:34:41] <bblack>	 load average: 16.68
[22:36:02] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:36:02] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:36:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:36:02] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:36:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:36:03] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:36:03] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:36:04] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:36:04] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:36:05] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:36:05] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:36:06] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:36:12] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:36:12] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:36:12] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:36:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:36:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:36:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:36:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:36:14] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:36:14] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:36:15] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:36:32] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[22:36:32] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:36:32] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3001 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:36:33] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:36:33] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:36:33] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:36:33] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:36:34] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:36:34] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:36:35] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:36:35] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:36:42] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:36:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:36:53] <bblack>	 prometh+  9406     1 99 Feb22 ?        100-20:00:44 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/ops/metrics -web.listen-address 127.0.0.1:9900 -web.external-url http://prometheus/ops -storage.local.retention 2190h0m0s -config.file /srv/prometheus/ops/prometheus.yml -storage.local.chunk-encoding-version
[22:36:59] <bblack>	  2
[22:37:13] <bblack>	 ^ thsi bit of prometheus, seems to be locking up on CPU% and causing huge iowait on disk
[22:38:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:38:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:38:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:38:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:38:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:38:03] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:38:03] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:38:04] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:38:04] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:38:05] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:38:05] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:38:06] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:38:06] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:38:06] <ebernhardson>	 Krinkle: that failure isn't related to the prior latency issue that i moved an index for, it's something else that also happened for 30 minutes overnite that i noticed in our dashboards. Not sure yet what it is.
[22:38:07] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:38:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:38:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:38:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:38:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:38:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:38:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:38:23] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:38:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:38:23] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:38:24] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:38:24] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:38:25] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:38:25] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:38:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:38:32] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:43:12] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:43:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:43:12] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:43:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:43:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:43:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:43:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:43:14] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:43:14] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:43:15] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:43:15] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:43:16] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:43:21] <bblack>	 ridiculous :P
[22:43:22] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:43:22] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:43:22] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:43:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:43:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:43:23] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:43:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:43:24] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:43:24] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:43:25] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:43:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:43:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:43:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:43:42] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:43:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:43:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:43:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:43:44] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:43:44] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:43:45] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:44:53] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:44:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:44:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[22:44:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:44:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:45:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:45:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:45:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:45:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:45:03] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:45:03] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:45:03] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:45:04] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:45:04] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:45:05] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:45:05] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:45:06] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:45:06] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:45:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:45:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:45:12] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:45:13] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:45:13] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:45:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:45:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:45:14] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:45:14] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:47:33] <thcipriani>	 ebernhardson: Krinkle so it seems like whatever is causing https://phabricator.wikimedia.org/T192609 was in wmf.30 as soon as I rolled back those warnings stopped, still happening in group0 wikis. Adding as a train blocker, FYI.
[22:47:54] <Krinkle>	 thx
[22:48:19] <Krinkle>	 thcipriani: btw, it seems fatal-monitor (on logstash) doesn't include type:mw/channel:error 
[22:48:29] <Krinkle>	 That'll be needed as soon those will no be in type:hhvm anymore
[22:48:32] <Krinkle>	 (separate from php7 migration)
[22:48:52] <Krinkle>	 mw itself has a setting to sent to channel=error without also sending to php stderr (type:hhvm currently)
[22:48:56] <ebernhardson>	 thcipriani: i have one suspicion, a patch that changed how we serialize jobs to work with the new job queue. looking into it
[22:48:59] <Krinkle>	 which is planned to be turned on and alreayd on in beta
[22:49:17] <thcipriani>	 ebernhardson: thank you!
[22:49:53] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144571 (10BBlack)
[22:49:55] <thcipriani>	 Krinkle: k, we'll need to update fatal monitor as well as the logstash_checker script scap uses to check canaries for error rate spikes.
[22:49:59] <Reedy>	 Might want to restore https://gerrit.wikimedia.org/r/#/c/427759/ and merge/deploy if 1.29 is hanging around
[22:50:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[22:50:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:50:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:50:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:50:03] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:50:04] <Reedy>	 .29
[22:50:12] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:50:12] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:50:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:50:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:50:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:50:13] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:50:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:50:14] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:50:14] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:50:15] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:50:15] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:50:16] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:50:16] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:50:21] <Reedy>	 uh, not that one
[22:50:22] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:50:22] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:50:22] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:50:30] <Reedy>	 https://gerrit.wikimedia.org/r/#/c/427754/
[22:50:32] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:50:32] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:50:32] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:50:32] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:50:32] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:50:32] <wikibugs>	 (03PS1) 10Thcipriani: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826
[22:50:33] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:50:33] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:50:33] <bblack>	 and there it goes again, trying a daemon restart
[22:50:34] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:51:52] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:51:52] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:51:52] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:51:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[22:51:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:51:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:51:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[22:51:54] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:52:03] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:52:03] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:52:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[22:52:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:52:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[22:52:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:52:12] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[22:52:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[22:52:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:52:14] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:52:14] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:52:15] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[22:52:15] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:52:16] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:52:16] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[22:52:17] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:52:22] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[22:52:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:52:34] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani)
[22:52:42] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[22:52:42] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:52:42] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[22:52:42] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[22:52:42] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:52:43] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:52:43] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:52:45] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[22:52:45] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3001 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:52:45] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[22:53:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani)
[22:55:03] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828
[22:55:42] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144593 (10BBlack) It did keep spamming by the time I got done writing the above.  Attempting to stop it now, but the basic daemon "stop" operation via systemctl is taking quite a long time (over 3 minute...
[22:57:40] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144601 (10BBlack) And that was followed by this, by the time it finally stopped itself ~5 minutes later: ``` Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Don...
[22:59:14] <wikibugs>	 (03CR) 10jenkins-bot: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani)
[22:59:31] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani)
[22:59:33] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[22:59:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[22:59:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[22:59:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[22:59:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[22:59:43] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[22:59:52] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[23:00:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[23:00:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[23:00:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[23:00:02] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[23:00:02] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[23:00:03] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[23:00:03] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[23:00:12] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[23:00:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[23:00:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[23:00:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[23:00:13] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[23:00:13] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[23:00:14] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[23:00:14] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[23:00:15] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[23:00:15] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:00:16] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:00:22] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[23:00:22] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[23:00:23] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[23:00:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[23:00:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[23:00:23] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[23:00:23] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[23:00:24] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[23:00:24] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[23:00:42] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[23:00:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[23:00:43] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3001 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:00:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[23:00:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[23:00:43] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[23:00:43] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:00:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani)
[23:00:52] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[23:02:15] <wikibugs>	 (03PS2) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513)
[23:04:51] <logmsgbot>	 !log thcipriani@tin Synchronized php: complete group1 and group2 wikis back to 1.31.0-wmf.29 (duration: 01m 16s)
[23:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:57] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani)
[23:08:46] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144640 (10BBlack) It seems to be having problems coming up cleanly too, so more spam.  First chunk of startup logs:  ``` Apr 19 22:58:02 bast3002 systemd[1]: Starting prometheus server (instance ops)......
[23:09:30] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144641 (10BBlack) [also, I've downtimed all the esams-specific prometheus-based alerts in icinga for 24h now (varnish child-counting checks and pybal bgp checks)]
[23:09:41] <bblack>	 hopefully no more spam.  worst case eventually one more wave of RECOVERY
[23:10:43] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3001 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:10:52] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:10:53] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[23:11:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[23:11:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[23:11:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[23:11:17] <bblack>	 it's always the worst case :P
[23:11:22] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:11:22] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops
[23:11:23] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[23:11:24] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops
[23:11:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[23:11:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops
[23:11:32] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops
[23:11:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[23:11:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[23:11:33] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[23:11:43] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[23:11:43] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops
[23:11:43] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[23:11:43] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[23:11:43] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[23:11:45] <greg-g>	 bblack: well you jinxed it by mentioning it
[23:11:47] <bblack>	 in any case, these are all downtimed for 24h now, they're just re-alerting recovery because the last time they failed was before the downtime was set
[23:11:52] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[23:11:52] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[23:11:52] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops
[23:11:52] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[23:11:52] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[23:11:53] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[23:11:53] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[23:12:02] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops
[23:12:02] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[23:12:03] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops
[23:12:03] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[23:12:12] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops
[23:12:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops
[23:12:13] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops
[23:12:13] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops
[23:12:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[23:12:13] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops
[23:12:14] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops
[23:12:14] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops
[23:12:22] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops
[23:12:22] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops
[23:12:32] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops
[23:12:32] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops
[23:12:32] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[23:12:32] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops
[23:12:33] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops
[23:12:33] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops
[23:13:09] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144648 (10BBlack) Crash recovery appears to have completed at about 23:10:33 and things came back online.  We'll see if it remains stable.  Leaving the downtimes in place to avoid more spamming of IRC.
[23:13:30] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.31.0-wmf.30/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 17s)
[23:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:36] <stashbot>	 T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148
[23:16:36] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4144668 (10EddieGP) p:05Low>03High Indeed, the jobqueue on beta is still broken, although...
[23:16:56] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4144670 (10demon) That was part of that commit. I was kinda following the example set by the conftool package. If this is problemati...
[23:16:56] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.31.0-wmf.29/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 18s)
[23:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:31] <mutante>	 we have over 300 wikis now
[23:27:40] <mutante>	 lfn.wp was 300 in my count
[23:27:48] <mutante>	 wikipedias i mean, sorry