[00:00:05] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:32] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 22.09, 22.96, 23.99 [00:04:00] Can someone disable https://phabricator.wikimedia.org/H285 please? It seems to be mistakenly configured to use an OR instead of an AND and thus adds #product-analytics to *any* task that has *any* activity. [00:04:02] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix path to hi.wikimedia.org 1x logo ([[Gerrit:427567]]) [00:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:32] Jayprakash12345: okay I sync your namespace change and I'm done [00:05:00] Ah, I see greg just edited that. [00:07:03] (03PS1) 10Dereckson: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) [00:07:09] Jayprakash12345: you've X-Wikimedia-Debug installed, haven't you? [00:07:27] yeah I have [00:07:34] (03CR) 10Dereckson: [C: 032] Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [00:08:23] Just ping me for check patch at mwdebug1002 [00:08:35] okay, waiting for zuul now [00:08:57] (03Merged) 10jenkins-bot: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [00:09:11] Jayprakash12345: done [00:09:55] Looks good, Now Namespace in Hindi [00:10:04] 10Puppet, 10Beta-Cluster-Infrastructure: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4141484 (10EddieGP) [00:10:09] Go ahead [00:11:04] (03CR) 10jenkins-bot: Set project namespace for hi.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427569 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [00:12:11] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set project namespace for hi.wikimedia (T188366) (duration: 01m 16s) [00:12:17] !log Wikis creation done [00:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:18] T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366 [00:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:00] jouncebot: now [00:13:00] For the next 0 hour(s) and 46 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T0000) [00:14:43] Dereckson: Thanks for being here [00:14:48] You're welcome. [00:15:32] (03CR) 10Krinkle: [C: 031] Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [00:17:49] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141502 (10EddieGP) [00:21:21] (03PS4) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [00:21:54] (03CR) 10jerkins-bot: [V: 04-1] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [00:23:02] (03CR) 10EddieGP: "https://phabricator.wikimedia.org/T192473#4141453" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 (owner: 10Aaron Schulz) [00:23:50] (03PS5) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [00:24:29] (03CR) 10jerkins-bot: [V: 04-1] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [00:38:54] (03PS6) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [00:54:12] (03PS7) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [00:57:34] (03CR) 10Dzahn: [C: 032] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [01:05:13] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/org/wikimedia/releases/parsoid] [01:06:12] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/org/wikimedia/releases/parsoid] [01:25:42] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational [01:28:42] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:47:37] 10Operations, 10Mail, 10Surveys: Qualtrics cannot send email to wikimedia.org addresses - https://phabricator.wikimedia.org/T176666#4141649 (10Neil_P._Quinn_WMF) [01:57:54] (03Abandoned) 10Krinkle: Just run updateArticleCount.php over all.dblist [puppet] - 10https://gerrit.wikimedia.org/r/363639 (owner: 10Reedy) [02:36:44] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 05m 52s) [02:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:46] Dereckson: hi! no, thankfully not an emergency [03:18:44] !log decommissioning Cassandra, restbase1010-c -- T189822 [03:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:50] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [04:42:21] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141728 (10aaron) The warnings are pointless, the patch above adds an isset() check. [05:21:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) [05:23:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:24:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:26:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 for alter table (duration: 01m 33s) [05:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:23] !log Deploy schema change on db1097:3315 - T191519 T188299 T190148 [05:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:30] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:27:30] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:27:31] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:29:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427593 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:31:28] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141772 (10Marostegui) So this is almost confirmed related to atop. I killed it yesterday at around 14:30 and it was remained stopped till 00:00 (where it started automatic... [05:33:24] !log Revert RX buffer changes on db1114 - T191996 [05:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:30] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [05:35:33] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141775 (10Marostegui) RX buffers reverted ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX: 2047 RX Mini: 0 RX Jumbo: 0 TX: 511 Current... [05:36:08] !log Kill atop on db1114 - T191996 [05:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:30] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#4141809 (10Joe) [05:56:34] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141813 (10Joe) [05:58:05] 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#4141816 (10Joe) [05:58:11] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149754 (10Joe) [05:59:58] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141821 (10Joe) We've had 3 mdadm checkarray full runs since we merged the change in february, and no alert went off in the meantime. I would be inclined to consider this... [06:00:15] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#4141822 (10Joe) 05Open>03Resolved [06:05:36] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10Joe) I would strongly suggest that any system that wants to archive geoip data from maxmind should create its own repository of data and NOT... [06:08:01] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4141827 (10Joe) >>! In T136732#4139610, @Ottomata wrote: > We could do that, but we wanted something centralized and reproducable (e.g. include a puppe... [06:12:08] (03PS1) 10Giuseppe Lavagetto: Revert "releases: add directory for parsoid archive" [puppet] - 10https://gerrit.wikimedia.org/r/427594 [06:12:25] (03CR) 10Giuseppe Lavagetto: [C: 032] "releases1001 puppet-agent[10365]: Could not find user releasers-parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto) [06:16:12] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:21:41] (03PS1) 10Giuseppe Lavagetto: contint::packages::php: fix lua package name [puppet] - 10https://gerrit.wikimedia.org/r/427595 [06:23:05] (03CR) 10Giuseppe Lavagetto: [C: 032] contint::packages::php: fix lua package name [puppet] - 10https://gerrit.wikimedia.org/r/427595 (owner: 10Giuseppe Lavagetto) [06:28:43] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:31:52] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/enforce-users-groups] [06:34:36] (03CR) 10Gilles: [C: 031] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [06:35:13] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:37:57] (03PS1) 10Elukey: role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596 [06:39:15] (03PS2) 10Elukey: role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596 [06:39:50] (03CR) 10Elukey: [C: 032] role::prometheus::ops: fix new kafka analytics role class name [puppet] - 10https://gerrit.wikimedia.org/r/427596 (owner: 10Elukey) [06:49:32] PROBLEM - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.33 and port 9042: Connection refused [06:49:42] PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [06:49:52] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:50:10] hello restbase1016 [06:50:12] PROBLEM - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:54:26] can't find much in the cassandra logs [06:56:52] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] I'd ping the services team before attempting any restarts, so they can take a look to what happened [06:57:15] it shouldn't be a problem if one instance is down [06:57:19] mobrovac: --^ [06:57:44] or ^ godog [06:57:59] yep yep [06:58:43] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:43] RECOVERY - cassandra-b service on restbase1016 is OK: OK - cassandra-b is active [07:01:53] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [07:01:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 [07:03:23] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui) [07:04:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui) [07:06:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 after alter table (duration: 01m 17s) [07:06:13] RECOVERY - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-b valid until 2018-08-17 16:11:27 +0000 (expires in 120 days) [07:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:32] RECOVERY - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is OK: TCP OK - 0.001 second response time on 10.64.0.33 port 9042 [07:06:58] puppet brought it up again :) [07:09:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427597 (owner: 10Marostegui) [07:16:51] 10Operations: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4141839 (10akosiaris) 05stalled>03Open >1 month with no incident. I 'll proceed with rebooting all ganeti VMs on row_C and then move on to codfw [07:16:54] (03PS1) 10Vgutierrez: install_server: Reimage lvs4006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427598 (https://phabricator.wikimedia.org/T191897) [07:24:16] !log reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121 [07:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:22] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [07:26:59] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427598 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [07:27:52] (03CR) 10Filippo Giunchedi: [C: 031] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [07:28:06] elukey: ack, yeah I think we saw that before :( [07:30:28] :( [07:32:23] !log Depool and reimage lvs4006 - T191897 [07:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:29] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [07:34:30] (03PS1) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [07:37:34] (03PS1) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) [07:43:33] (03CR) 10Filippo Giunchedi: [C: 032] "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [07:43:38] (03PS8) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [07:47:30] !log set cache=none for ganeti VMs in codfw cluster configuration. VM reboots to follow T181121 [07:47:32] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:37] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [07:47:42] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:47:43] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:48:02] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:48:02] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:48:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:48:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 404 (expecting: 200) [07:48:18] 404 ? [07:48:20] (03PS2) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [07:48:32] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [07:48:40] wut? [07:48:42] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [07:48:43] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [07:48:43] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [07:48:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [07:48:43] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [07:48:46] a recovering already ? what on earth happened ? [07:48:52] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [07:48:52] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [07:48:53] (03CR) 10jerkins-bot: [V: 04-1] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [07:49:02] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [07:49:02] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [07:49:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:49:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:49:15] akosiaris: maybe somebody changed the cat page? [07:49:47] https://en.wiktionary.org/w/index.php?title=cat&action=history [07:49:47] article hasn't changed since 4th of Feb so unrelated to that [07:49:50] <_joe_> that looks like a real issue on mobileapps [07:49:59] <_joe_> no way it's a problem with wikipedia [07:50:15] wait, Cat vs cat [07:50:17] <_joe_> or wiktionary [07:50:47] I am not so sure about that [07:51:04] the last change is a couple of mins ago [07:51:18] yes, and someone decided to inline the image base base64 [07:51:25] <_joe_> ok yeah I think it's vandalism [07:51:28] image as base64* [07:51:42] <_joe_> cat, not CAT [07:51:58] yes I looked for "cat" [07:52:56] (03PS3) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [07:53:22] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141874 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4006.ulsfo.wmnet ``` The log can be found in `/var/lo... [07:53:27] (03CR) 10jerkins-bot: [V: 04-1] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [07:54:21] so it's a small edit war right now [07:54:34] <_joe_> akosiaris: you mean a vandal [07:54:52] yeah, vandals wage edit wars by definition [07:54:57] (03CR) 10Elukey: [C: 031] Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [07:54:58] but yes it's vandalism [07:56:02] and the bot just reverting the damage [07:56:07] but the 404 was a tad weird [07:58:27] <_joe_> the 404 is mobileapps telling the client that there was no summary to be fetched for the page [07:58:31] <_joe_> so kinda expected? [07:58:49] <_joe_> if the vandalism wiped out the definition [08:05:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4141920 (10ema) [08:05:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4141918 (10ema) 05Resolved>03Open Re-opening, this morning we had two icinga criticals for lawrencium and lawrencium.mgmt being down. Some de... [08:08:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4102460 (10MoritzMuehlenhoff) There are still DNS entries in git: jmm@korn:~/git/dns$ rgrep lawrenc * templates/10.in-addr.arpa:94 1H IN P... [08:11:47] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141946 (10EddieGP) [08:14:07] !log upgrading app server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854) [08:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:14] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [08:14:22] !log reboot deploy1001 and arm keyholder T175288 [08:14:23] _joe_: yeah that's exactly what happened [08:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:28] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [08:14:30] vandal removed the definitions [08:14:39] (03CR) 10DCausse: Add cirrussearch settings for wikibase (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [08:14:41] (03PS15) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [08:14:51] RECOVERY - nutcracker port on deploy1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:15:01] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational [08:15:41] RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker [08:15:53] (03PS1) 10Gilles: Upgrade to 2.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/427612 (https://phabricator.wikimedia.org/T27611) [08:20:18] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141972 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4006.ulsfo.wmnet'] ``` and were **ALL** successful. [08:20:22] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141975 (10EddieGP) [08:21:52] (03PS1) 10Filippo Giunchedi: rake: ignore rubocop Style/NumericPredicate in taskgen [puppet] - 10https://gerrit.wikimedia.org/r/427614 [08:23:13] (03CR) 10Filippo Giunchedi: [C: 032] rake: ignore rubocop Style/NumericPredicate in taskgen [puppet] - 10https://gerrit.wikimedia.org/r/427614 (owner: 10Filippo Giunchedi) [08:23:44] apergos: ^ try rebasing, should be working now [08:23:52] great thank you [08:24:18] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4141990 (10EddieGP) p:05Unbreak!>03Low Per aarons comment, just logspam. Seems the actual problem for renames was nutcracker, which I fixed in T192473#4141... [08:24:47] (03PS4) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [08:26:44] yep we're back in business [08:29:50] (03PS1) 10Vgutierrez: pybal: Re-enable bgp in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/427615 (https://phabricator.wikimedia.org/T191897) [08:32:47] (03PS5) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [08:35:40] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable bgp in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/427615 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [08:40:07] !log Repool (Re-enable BGP) lvs4006 - T191897 [08:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:12] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [08:43:58] (03PS6) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [08:45:16] (03PS3) 10Ema: role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) [08:46:07] (03PS2) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) [08:46:44] (03CR) 10Ema: [C: 032] role::kafka::analytics: get rid of ipsec [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) (owner: 10Ema) [08:46:57] (03PS3) 10Muehlenhoff: Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) [08:47:39] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142017 (10Vgutierrez) [08:48:34] (03CR) 10Muehlenhoff: [C: 032] Switch all mw hosts to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427608 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [08:49:20] (03PS2) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) [08:49:33] (03PS2) 10Muehlenhoff: Also handle Prometheus exporters in app server decom script [puppet] - 10https://gerrit.wikimedia.org/r/427340 [08:49:59] (03CR) 10jerkins-bot: [V: 04-1] alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [08:50:01] (03CR) 10Muehlenhoff: [C: 032] Also handle Prometheus exporters in app server decom script [puppet] - 10https://gerrit.wikimedia.org/r/427340 (owner: 10Muehlenhoff) [08:50:03] (03PS7) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [08:50:13] some days I am not meant to write even the simplest amount of puppet code. clearly today is one of those days [08:52:33] (03PS1) 10Filippo Giunchedi: rubocop: display cop names [puppet] - 10https://gerrit.wikimedia.org/r/427619 [08:53:36] apergos: well.. on Tuesday I hit almost every use case of the jenkins commit validator.. I got to PS4 to get the commit message right /o\ [08:54:22] (03PS3) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) [08:54:39] I think I've only pissed off the commit message checker once [08:54:49] now I've jinxed it of course [09:00:25] (03PS1) 10Vgutierrez: install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897) [09:01:19] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [09:01:50] (03PS2) 10Vgutierrez: install_server: Reimage lvs4005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427621 (https://phabricator.wikimedia.org/T191897) [09:02:52] PROBLEM - DPKG on mw1276 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:02:52] PROBLEM - DPKG on mw1279 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:03:20] !log upgrading API server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854) [09:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [09:03:31] ^ that's me, forgot to silence in Icinga [09:03:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4142093 (10MarcoAurelio) 👍 [09:03:42] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:52] RECOVERY - DPKG on mw1276 is OK: All packages OK [09:03:52] RECOVERY - DPKG on mw1279 is OK: All packages OK [09:04:32] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 77206 bytes in 0.155 second response time [09:06:32] !log Depool and reimage lvs4005 as stretch - T191897 [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:38] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [09:10:59] 10Operations, 10monitoring, 10Patch-For-Review: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857#4142097 (10Volans) 05Open>03Resolved To summarize the work done recently, I've made an audit of existing checks and fixed/improv... [09:14:20] (03PS2) 10Filippo Giunchedi: base: alert on edac uncorrectable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) [09:14:28] !log installing Java security updates on maps* plus rolling restart of Cassandra to pick up new JRE [09:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:50] !log reboot puppetdb1001 for cache=none setting apply. T181121 [09:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:56] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [09:26:31] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4142112 (10fgiunchedi) I've taken a first stab at reporting uncorrectable errors in https://gerrit.wikimedia.org/r/c/422110/ as reported by the kernel, so at least... [09:27:47] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:47] PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:47] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:47] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:52] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:52] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:53] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:02] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:12] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:43] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:43] PROBLEM - puppet last run on ping1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:43] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:52] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:52] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:52] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:52] PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:52] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:53] PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:53] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:54] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:54] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:55] PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:55] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:56] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:56] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:57] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:12] PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:13] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:13] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:13] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:33] !log stop ircecho for a while, puppetdb1001 reboot was eventful [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:01] !log start a force puppet run in all of eqiad with a batch size of 30 [09:31:04] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142131 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2002.codfw.wmnet']... [09:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:53] (03PS2) 10Elukey: role::configcluster: upgrade zookeeper main-eqiad to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924) [09:32:09] mobrovac: all right I am ready [09:32:16] let's chat in here [09:32:49] ok vamos elukey [09:33:32] !log upgrade zookeper on conf100[123] from 3.4.5 to 3.4.9 - T182924 [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:38] T182924: Refresh zookeeper nodes in eqiad - https://phabricator.wikimedia.org/T182924 [09:33:48] elukey: FYI ircecho is still stopped [09:34:33] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142137 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4005.ulsfo.wmnet ``` The log can be found in `/var/lo... [09:35:35] volans: thanks for the reminder, will watch icinga [09:36:01] mobrovac: doing sanity checks [09:36:53] ok so added downtime, puppet disabled on conf100[123], verified that cluster is working [09:37:00] 1001/2 are followers, 1003 leader [09:37:22] (03CR) 10Elukey: [C: 032] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:37:42] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4142144 (10fgiunchedi) >>! In T175361#4140768, @herron wrote: > Indeed, after upgrading to `3.0.0~rc5-1~bpo9+1` mtail starts up happily. > > @fgiunchedi do you think it would be safe to pin the mtail package to... [09:38:22] proceeding with conf1001 [09:39:27] k [09:41:21] 1001 upgraded, all good so far [09:42:05] elukey: let's wait 3,4 minutes before proceeding [09:42:18] yep yep [09:42:27] 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4142148 (10ArielGlenn) I'm planning to try disabling the nfs attribute cache for files and directories on one of the snapshots for t... [09:42:28] i want to make sure all is still good on our side [09:43:18] (03CR) 10Gehel: [C: 031] [cirrus] Increase the number of shards for wikidatawiki_content, enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427176 (https://phabricator.wikimedia.org/T192064) (owner: 10DCausse) [09:44:01] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4142149 (10MarcoAurelio) jobqueue at beta is down again; see ok elukey, looking good, let's proceed [09:47:10] mobrovac: ack, kafka/burrow metrics looks good [09:49:19] upgrading 1002 [09:49:22] 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142156 (10akosiaris) [09:49:38] 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142168 (10akosiaris) p:05Triage>03High [09:49:57] 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142156 (10akosiaris) [09:50:28] done [09:51:22] !log rolling restart of Cassandra on maps completed [09:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] ^ gehel [09:51:49] 10Operations, 10Puppet, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4142170 (10EddieGP) [09:52:12] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:52:41] RECOVERY - puppet last run on bast1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:52:42] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:53:51] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:01] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:54:11] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:54:42] RECOVERY - puppet last run on ms1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:46] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142197 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2002.codfw.wmnet'] ``` and were **ALL** successful. [09:54:56] !log Updating puppet compiler facts [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:02] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:55:08] !log reboot ganeti VMs on row_B in codfw for cache=none setting. T181121 [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:14] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [09:55:46] elukey: ok let's proceed with the lead now? [09:55:51] mobrovac: ack [09:55:59] didn't see any glitch in my metrics [09:56:02] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:56:11] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:56:32] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:56:32] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:56:41] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:56:41] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:56:51] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:57:02] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:57:02] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:57:53] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:57:54] RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:58:48] mobrovac: done, new leader is 1002 [09:58:53] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:59:04] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:59:30] !log complete migration of zookeeper on conf100[123] [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:35] elukey: all seems good from my pov [10:02:11] mobrovac: from mine too, thanks for the support! [10:02:40] thnx for pushing this through elukey! [10:03:28] mobrovac: hope that you will not hate me by the end of next week when we'll swap conf100[123] with conf100[456] :D [10:04:26] elukey: maybe i should go on vacations then? :D [10:04:39] ahahha [10:04:47] (03PS3) 10Filippo Giunchedi: base: alert on edac (un)correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) [10:07:21] 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4142238 (10akosiaris) [10:15:37] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4005.ulsfo.wmnet'] ``` and were **ALL** successful. [10:23:04] (03PS5) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [10:23:24] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/427629 (https://phabricator.wikimedia.org/T191897) [10:23:59] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/427629 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:27:32] !log Repool (Re-enable BGP) lvs4005 - T191897 [10:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:38] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:30:57] (03PS1) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) [10:34:20] !log upgrading API servers mw1221-mw1235 to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [10:34:24] (03CR) 10Gehel: [C: 04-1] Make tilerator_storage_id to kartotherian (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [10:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:49] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142329 (10Vgutierrez) [10:39:47] (03CR) 10Sbisson: [C: 031] "It's time to remove those." [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel) [10:41:23] (03PS2) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) [10:41:31] (03PS1) 10Vgutierrez: hieradata: clean-up ulsfo lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897) [10:45:14] (03CR) 10Vgutierrez: "pcc looks happy and shows the expected noop: https://puppet-compiler.wmflabs.org/compiler02/10978/" [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:48:51] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634 [10:49:10] (03CR) 10Vgutierrez: [C: 032] hieradata: clean-up ulsfo lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/427632 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:50:57] jouncebot: next [10:50:58] In 2 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1300) [10:54:02] (03PS1) 10Jcrespo: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 [10:56:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) [10:57:18] jynus: I will go after you [10:57:28] ok, then merging now [10:57:39] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo) [10:58:52] (03Merged) 10jenkins-bot: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo) [10:59:10] (03CR) 10jenkins-bot: mariadb: Depool db2076 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427635 (owner: 10Jcrespo) [10:59:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:00:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:01:33] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2076 (duration: 01m 18s) [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113:3315 for alter table (duration: 01m 16s) [11:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:27] !log starting reimage of db2076 [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427636 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:05:18] !log Deploy schema change on db1113:3315 - T191519 T188299 T190148 [11:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:26] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [11:05:26] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [11:05:26] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [11:07:35] (03CR) 10Alexandros Kosiaris: [C: 032] Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634 (owner: 10Alexandros Kosiaris) [11:09:46] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 17s) [11:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:01] (03CR) 10jenkins-bot: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427634 (owner: 10Alexandros Kosiaris) [11:11:05] (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 [11:11:57] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris) [11:12:44] (03Merged) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris) [11:14:29] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: T181121 (duration: 01m 16s) [11:14:30] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142413 (10Marostegui) No more errors for the last 6 hours after killing atop. Also no drops or connections errors running the RX original buffers after reverting them as c... [11:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [11:16:03] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:04] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:23] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:24] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:24] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:31] (03CR) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427639 (owner: 10Alexandros Kosiaris) [11:16:34] PROBLEM - puppet last run on mc2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:43] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:43] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:44] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:53] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:54] PROBLEM - puppet last run on ms-be2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:04] PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:34] PROBLEM - puppet last run on mw2272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:43] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:44] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:18:13] PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:18:33] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142435 (10Gilles) How frequent were they as of late, before the change? [11:19:03] PROBLEM - puppet last run on mw2268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:03] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:03] PROBLEM - puppet last run on ganeti2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:09] hmm [11:19:13] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:13] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:13] PROBLEM - puppet last run on elastic2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:15] expected [11:19:23] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:20:49] !log upgrading app servers mw1238-mw1258 to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] !log Sanitize lfnwiki - T183566 [11:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:36] T183566: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566 [11:24:13] !log Run check_private_data on labsdb - T183566 [11:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:34] PROBLEM - DPKG on mw1246 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:25:33] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:34] RECOVERY - DPKG on mw1246 is OK: All packages OK [11:26:03] RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:26:23] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 77207 bytes in 0.096 second response time [11:27:08] ^ downtime expired, all fine [11:28:23] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Service[hhvm] [11:39:29] !log upgrading eqiad video scalers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [11:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:44:11] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [11:44:20] RECOVERY - puppet last run on elastic2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:44:20] RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:44:21] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:01] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:46:20] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:46:30] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:30] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:46:40] RECOVERY - puppet last run on mc2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:46:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4142485 (10MoritzMuehlenhoff) It's also still in puppet, BTW: jmm@sarin:~$ sudo cumin lawr* 1 hosts will be targeted: lawrencium.eqiad.wmnet DRY... [11:47:00] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:47:01] RECOVERY - puppet last run on ms-be2034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:47:01] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:47:01] RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:47:40] RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:47:41] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:47:41] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:48:10] RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:48:51] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:48:55] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 [11:49:01] RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:49:10] RECOVERY - puppet last run on ganeti2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:49:10] RECOVERY - puppet last run on mw2268 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:50:12] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo) [11:51:24] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo) [11:51:38] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2076 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427645 (owner: 10Jcrespo) [11:52:20] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [11:54:10] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [11:55:54] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2076 (duration: 01m 16s) [11:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:31] 10Operations: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4142506 (10akosiaris) 05Open>03Resolved All VMs have been migrated to using `cache=none`. I 'll resolve this successfully, hopefully we will not meet this issue again [12:21:20] 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4142521 (10akosiaris) 05stalled>03Open With `cache=none` being set in all cluster for unrelated reasons, this is now unblocked. In the meantime `jessie-backports` has upgrade to `2.8`. Fortunately the changelog[... [12:21:23] (03PS3) 10Matthias Mullie: Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [12:21:47] (03CR) 10Matthias Mullie: [C: 031] "LGTM, but lacking permissions to +2" [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [12:26:42] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142530 (10fgiunchedi) Somewhat frequent but not a lot, I don't have exact numbers tho {F17126546} [12:29:35] (03PS3) 10Sbisson: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) [12:33:52] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142540 (10Gilles) Is the Apr 18 occurence on that screenshot after the change was deployed? [12:39:48] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4142546 (10fgiunchedi) @mmodell for sure, I was reading the log and I wonder why architecture changed from all to any? [12:40:06] (03PS1) 10Jcrespo: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 [12:40:27] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10BBlack) >>! In T191996#4139205, @Marostegui wrote: > For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: > ```... [12:40:48] gilles: I'll forward you the cron emails if that's ok? looks simpler to me [12:41:05] godog: sure [12:43:16] of course gmail doesn't allow forwarding conversations )o) [12:43:20] mass-forward that is [12:43:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo) [12:44:52] (03Merged) 10jenkins-bot: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo) [12:44:58] gilles: {{done}} [12:45:02] thanks [12:46:48] Date: Tue, Apr 17, 2018 at 11:31 AM - which TZ is that? [12:47:00] godog: ^ [12:47:30] ah, nevermind, there are latter occurences [12:48:27] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2075 (duration: 01m 16s) [12:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:37] gilles: looks like my browser's so CET/CEST [12:48:53] (03CR) 10jenkins-bot: mariadb: Depool db2075 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427649 (owner: 10Jcrespo) [12:49:39] 10Operations, 10monitoring: Improve remote IPMI monitoring - https://phabricator.wikimedia.org/T192547#4142570 (10Volans) [12:49:52] 10Operations, 10monitoring: Improve remote IPMI monitoring - https://phabricator.wikimedia.org/T192547#4142580 (10Volans) p:05Triage>03Normal [12:54:54] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142590 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2003.codfw.wmnet']... [12:58:45] !log starting reimage of db2075 [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:02:53] I'm here [13:06:13] anybody to swat? [13:07:51] I can if needed [13:08:07] Reedy, well, nobody else's here apparently :) [13:08:39] (03PS2) 10Reedy: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm) [13:08:46] (03CR) 10Reedy: [C: 032] Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm) [13:10:08] (03Merged) 10jenkins-bot: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm) [13:10:22] (03CR) 10jenkins-bot: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm) [13:10:27] (03PS4) 10Reedy: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki) [13:10:39] (03CR) 10Reedy: [C: 032] Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki) [13:11:57] (03Merged) 10jenkins-bot: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki) [13:13:00] No namespace dupes either [13:13:25] Reedy, you mean...no need for the script? [13:13:29] Yup [13:13:32] 0 conflicts [13:13:46] that's good, isn't it :) [13:14:55] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: T192427 T189277 (duration: 01m 17s) [13:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:02] T192427: Enable $wgUseRCPatrol in hiwikiversity - https://phabricator.wikimedia.org/T192427 [13:15:02] T189277: Change aliases on ruwiki - https://phabricator.wikimedia.org/T189277 [13:15:19] thank you for the deploy Reedy ! [13:15:40] np [13:15:41] (03CR) 10jenkins-bot: Change NS aliases on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418070 (https://phabricator.wikimedia.org/T189277) (owner: 10Framawiki) [13:16:07] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4142631 (10Gilles) Looking at the content of the emails @fgiunchedi forwarded to me, the last offense right when the next cron kicks in after the restart and keeps going f... [13:18:33] (03CR) 10Ottomata: "Yar, I'm thikning this role class based targeting is not the best way to do this. Pretty fragile and disconnected." [puppet] - 10https://gerrit.wikimedia.org/r/427596 (owner: 10Elukey) [13:20:18] (03PS1) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) [13:20:52] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4142646 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2003.codfw.wmnet'] ``` and were **ALL** successful. [13:21:05] (03CR) 10jerkins-bot: [V: 04-1] base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:22:05] 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4142647 (10MoritzMuehlenhoff) Or we could upgrade the Ganeti cluster to stretch? It provides qemu 2.8 out of the box. [13:23:10] (03CR) 10Filippo Giunchedi: "List of currently-affected hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:25:38] (03PS2) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) [13:30:13] !log upgrading mw1334-mw1337 (job runners) to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [13:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:45] !log reindexing serbian wikis on elastic@eqiad (T189265) [13:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:51] T189265: Re-index Serbian Wikis - https://phabricator.wikimedia.org/T189265 [13:33:50] !log Start atop on db1114 - T191996 [13:33:54] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142669 (10Marostegui) >>! In T191996#4142547, @BBlack wrote: > > Not that it's probably the issue here, but this probably isn't ideal. If you look at `grep eno1 /proc/in... [13:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:56] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [13:35:32] (03PS1) 10Elukey: Release 0.10.1-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) [13:37:55] (03PS2) 10Elukey: Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) [13:39:40] !log Stop atop on db1114 - T191996 [13:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:46] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [13:40:33] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142677 (10Marostegui) 05Open>03Resolved a:03Marostegui So, as soon as I started atop, errors came back and packets dropped. So the culprit is clearly `atop`. I am go... [13:42:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 [13:43:13] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142681 (10Marostegui) [13:44:21] (03PS1) 10Jcrespo: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 [13:45:13] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo) [13:46:28] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo) [13:48:49] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2075 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427659 (owner: 10Jcrespo) [13:50:06] (03CR) 10Muehlenhoff: [C: 032] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [13:50:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 (owner: 10Jcrespo) [13:50:37] (03PS2) 10Jcrespo: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 [13:53:03] (03PS2) 10Muehlenhoff: Remove rendering from lvs::configuration::lvs_service_ips for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/427356 [13:53:19] (03PS1) 10Gilles: Xenon: don’t generate SVGs for recently modified logs [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249) [13:55:12] (03CR) 10Muehlenhoff: [C: 032] Remove rendering from lvs::configuration::lvs_service_ips for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/427356 (owner: 10Muehlenhoff) [13:56:48] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2075, depool db2074 (duration: 01m 16s) [13:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:44] (03CR) 10jenkins-bot: mariadb: Depool db2074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427661 (owner: 10Jcrespo) [14:00:38] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Marostegui) [14:01:05] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142739 (10Marostegui) [14:01:09] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129002 (10Marostegui) [14:01:10] (03PS2) 10Gehel: maps: remove sources.yaml [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948) [14:02:44] (03CR) 10Ottomata: [C: 031] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [14:03:54] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142741 (10jcrespo) If I have to guess, I would say it is the combination of the stretch version + high load (if it is network, cpu or io, I cannot say)- I think enwiki API are hosts with logs of ongoing... [14:04:01] (03CR) 10Gehel: [C: 032] maps: remove sources.yaml [puppet] - 10https://gerrit.wikimedia.org/r/423721 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel) [14:04:18] (03CR) 10Muehlenhoff: [C: 032] Rebuild for Stretch as tidy-0.99 [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [14:08:25] (03PS1) 10Dzahn: releases-parsoid: fix directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) [14:09:03] (03PS2) 10Gehel: maps: cleanup of sources.yaml code [puppet] - 10https://gerrit.wikimedia.org/r/423722 (https://phabricator.wikimedia.org/T112948) [14:09:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 [14:11:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui) [14:12:04] !log starting reimage of db2074 [14:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui) [14:13:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427669 (owner: 10Marostegui) [14:13:41] (03CR) 10Gehel: [C: 032] "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/10980/" [puppet] - 10https://gerrit.wikimedia.org/r/423722 (https://phabricator.wikimedia.org/T112948) (owner: 10Gehel) [14:14:34] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142754 (10BBlack) When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like: ``` root 26337 1 0 00:00 ? 00:00:04 /usr/bin/atop -a -... [14:14:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113:3315 after alter table (duration: 01m 16s) [14:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:57] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#4142755 (10MoritzMuehlenhoff) 05Open>03Resolved All traces of the image scalers are gone. There's some additional puppet refactoring to be done, but unrel... [14:15:59] (03PS2) 10Dzahn: releases-parsoid: add directory and fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) [14:16:20] (03PS1) 10Muehlenhoff: Use a WMF-specific version number, not one from Debian backports [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/427670 [14:16:43] (03PS3) 10Dzahn: releases-parsoid: add directory and fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) [14:17:08] (03CR) 10Dzahn: "yep, sorry about that and thanks for reverting. i merged and something happened in RL that distracted me" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto) [14:17:22] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142762 (10Marostegui) [14:17:50] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Marostegui) @BBlack in the case of db1114 atop was normally running without causing any issues, but every 10 minutes it would spike for like 2-3 seconds using lots of the core to their 100% (T1... [14:18:30] (03PS4) 10Dzahn: releases-parsoid: add directory and fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) [14:18:42] (03CR) 10Dzahn: [C: 032] "re-revert of https://gerrit.wikimedia.org/r/#/c/427594/" [puppet] - 10https://gerrit.wikimedia.org/r/427668 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:20:12] (03CR) 10Dzahn: "fixed in https://gerrit.wikimedia.org/r/#/c/427668/" [puppet] - 10https://gerrit.wikimedia.org/r/427594 (owner: 10Giuseppe Lavagetto) [14:22:51] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142770 (10faidon) >>! In T136732#4139610, @Ottomata wrote: > We could do that, but we wanted something centralized and reproducable (e.g. include a pu... [14:24:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) [14:24:47] (03PS1) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 [14:25:33] (03PS4) 10Gehel: Make tilerator_storage_id to kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [14:25:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [14:26:20] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4142789 (10Dzahn) - https://releases.wikimedia.org/parsoid/ has been create... [14:27:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [14:28:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 for alter table (duration: 01m 13s) [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427671 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [14:29:03] (03CR) 10Muehlenhoff: [C: 032] Use a WMF-specific version number, not one from Debian backports [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/427670 (owner: 10Muehlenhoff) [14:29:03] !log Deploy schema change on db1082 (this will generate lag on s5 on labs hosts) - T191519 T188299 T190148 [14:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:11] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [14:29:11] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [14:29:11] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [14:29:51] (03CR) 10Gehel: [C: 032] "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10981/" [puppet] - 10https://gerrit.wikimedia.org/r/427631 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [14:30:00] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142808 (10Ottomata) I don't have much context of how geowiki runs, but storing this in HDFS would be fine. We (I?) just thought it would be better to... [14:30:09] !log Star atop on db1114 without "-R" - T192551 [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] T192551: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 [14:34:00] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142814 (10Marostegui) So atop is now running on db1114 like: ``` root 30566 0.0 0.0 24712 7780 ? S (03CR) 10Andrew Bogott: [C: 031] base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:36:54] (03PS2) 10Gehel: wdqs: tune performance limits for the new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766) [14:37:21] (03PS1) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [14:37:59] (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:38:11] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4142823 (10fdans) Got it, yeah uploading to HDFS seems pretty sensible. The only documented application for this archive is history reconstruction, so... [14:42:07] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4142839 (10Vgutierrez) p:05Triage>03Normal [14:42:14] (03PS2) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [14:42:21] (03CR) 10Gehel: [C: 032] "puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler03/10984/" [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:42:47] (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:42:51] (03PS3) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [14:43:20] (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:44:52] (03PS4) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [14:45:34] (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:46:00] <_joe_> mutante: I did revert your change from yesterday as it was making puppet fail on releases* [14:46:07] <_joe_> the one introducing releases-parsoid [14:46:12] (03PS5) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [14:46:43] (03CR) 10jerkins-bot: [V: 04-1] releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [14:47:16] !log Create bureaucrat account for [[User:Anderi Store]] on romd.wikimedia (T187184) [14:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:22] T187184: WMF-hosted wiki request for Ro-Md Wikimedians user group - https://phabricator.wikimedia.org/T187184 [14:48:38] !log Erratum: read "[[User:Andrei Stroe]]" and not "[[User:Anderi Store]]" for the previous entry (T187184) [14:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] 10Operations, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284850 (10elukey) [14:49:13] was there any problem with irc.wikimedia.org today? [14:49:21] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#4142865 (10elukey) ping - status :) [14:53:06] (03PS1) 10Ottomata: Add certificates for kafka_test_broker and kafka_main-deployment-prep_broker [labs/private] - 10https://gerrit.wikimedia.org/r/427676 (https://phabricator.wikimedia.org/T167039) [14:56:50] (03CR) 10Ottomata: [V: 032 C: 032] Add certificates for kafka_test_broker and kafka_main-deployment-prep_broker [labs/private] - 10https://gerrit.wikimedia.org/r/427676 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:05:47] (03PS1) 10Elukey: Reimage analytics1069 to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/427679 (https://phabricator.wikimedia.org/T192557) [15:07:40] (03CR) 10Elukey: [C: 032] Reimage analytics1069 to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/427679 (https://phabricator.wikimedia.org/T192557) (owner: 10Elukey) [15:08:48] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4142938 (10JBennett) What's the data? From our clicktracking efforts what will we be collecting? [15:09:53] (03PS1) 10Ottomata: Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) [15:09:59] (03PS1) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) [15:10:22] (03CR) 10jerkins-bot: [V: 04-1] Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:10:26] (03CR) 10jerkins-bot: [V: 04-1] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:10:30] 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142948 (10Vgutierrez) [15:10:58] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:11:38] PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:11:39] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:11:41] !log sbisson@tin Started deploy [kartotherian/deploy@74121d5]: Deploy latest kartotherian with new i18n sources [15:11:45] (03PS2) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:25] (03CR) 10Ottomata: "No op in prod" [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:12:31] (03CR) 10Ottomata: [V: 032 C: 032] Temporarily look up main kafka cluster name for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/427680 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:13:13] 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142965 (10Vgutierrez) p:05Triage>03Normal [15:15:39] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4142985 (10thcipriani) [15:16:59] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4142999 (10thcipriani) This would also give us a place to test various mwscripts used by scap with php7 [15:16:59] !log sbisson@tin Finished deploy [kartotherian/deploy@74121d5]: Deploy latest kartotherian with new i18n sources (duration: 05m 19s) [15:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:10] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4143013 (10thcipriani) [15:17:14] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4143012 (10thcipriani) [15:17:28] PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused [15:20:28] _joe_: yes, i saw. thank you. i merged and got distracted. sorry about that. it's fixed now [15:21:06] (03PS1) 10Muehlenhoff: Add component/ci for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/427683 (https://phabricator.wikimedia.org/T191771) [15:21:10] scap question, where can I find a syntax reference for j2 templates? [15:22:54] (03PS1) 10ArielGlenn: keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) [15:23:00] (03CR) 10Muehlenhoff: [C: 032] Add component/ci for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/427683 (https://phabricator.wikimedia.org/T191771) (owner: 10Muehlenhoff) [15:23:13] (03CR) 10jerkins-bot: [V: 04-1] keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [15:24:37] (03CR) 10Filippo Giunchedi: [C: 031] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:25:56] nevermind, I just saw the erb syntax config [15:27:00] (03CR) 10Muehlenhoff: "Note that this won't upgrade existing stretch systems, so please upgrade these after rolling out the patch so that we have it in sync acro" [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:28:01] (03PS6) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [15:31:15] RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active [15:31:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 [15:31:46] (03CR) 10Herron: [C: 032] mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:31:55] (03CR) 10Herron: [C: 032] "will do!" [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:31:55] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [15:32:01] (03PS3) 10Herron: mtail: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/427681 (https://phabricator.wikimedia.org/T175361) [15:32:26] 10Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#4143046 (10akosiaris) >>! In T150532#4142647, @MoritzMuehlenhoff wrote: > Or we could upgrade the Ganeti cluster to stretch? It provides qemu 2.8 out of the box. I 'd rather not couple the 2 upgrades. Both need to b... [15:33:05] (03PS1) 10Andrew Bogott: labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339) [15:33:34] (03PS2) 10Andrew Bogott: labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339) [15:33:43] !log sbisson@tin Started deploy [kartotherian/deploy@89c4ca9]: Deploy latest kartotherian with new i18n sources (take 2) [15:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:56] RECOVERY - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-a valid until 2018-08-17 16:11:26 +0000 (expires in 120 days) [15:36:35] RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.004 second response time on 10.64.0.32 port 9042 [15:36:48] !log sbisson@tin Finished deploy [kartotherian/deploy@89c4ca9]: Deploy latest kartotherian with new i18n sources (take 2) (duration: 03m 05s) [15:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:14] !log sbisson@tin Started deploy [kartotherian/deploy@0a5a3ef]: Deploy latest kartotherian with new i18n sources (take 3) [15:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:37] (03CR) 10Andrew Bogott: [C: 032] labtestwikitech: add grants for labtestwiki, now on m5 [puppet] - 10https://gerrit.wikimedia.org/r/427688 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [15:38:05] PROBLEM - cassandra-b service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:38:26] PROBLEM - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:38:45] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused [15:38:45] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:39:14] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo) [15:40:30] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo) [15:40:44] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427686 (owner: 10Jcrespo) [15:42:37] !log sbisson@tin Finished deploy [kartotherian/deploy@0a5a3ef]: Deploy latest kartotherian with new i18n sources (take 3) (duration: 05m 22s) [15:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:42] (03PS1) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 [15:43:31] (03PS1) 10Gehel: maps: disable OSM replication during tile regeneration [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) [15:44:02] (03CR) 10jerkins-bot: [V: 04-1] Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott) [15:44:06] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143070 (10mepps) @Dzahn I'm lookign for access to Pivot (especially https://pivot.wikimedia.org/#banner_activity_minutely and https://pivot.wikimedia.org/#pageviews-hourly), SWAP (... [15:44:09] !log fdans@tin Started deploy [analytics/refinery@5d0f63f]: deploying to launch page preview job [15:44:13] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2074 (duration: 01m 17s) [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:06] (03PS2) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 [15:45:23] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143094 (10mepps) We want access for all of fr-tech actually for these purposes: https://phabricator.wikimedia.org/T181629 [15:45:23] (03PS3) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 [15:46:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 [15:47:45] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [15:47:59] !log Deploy schema change on dbstore1002 (s5) - T191519 T188299 T190148 [15:48:01] (03CR) 10Sbisson: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [15:48:05] RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:06] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [15:48:06] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [15:48:06] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [15:49:06] PROBLEM - etcd request latencies on neon is CRITICAL: 9.41e+04 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:49:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui) [15:49:26] PROBLEM - Request latencies on neon is CRITICAL: 1.234e+05 ge 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:49:46] PROBLEM - etcd request latencies on chlorine is CRITICAL: 1.223e+05 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:50:31] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui) [15:50:43] !log fdans@tin Finished deploy [analytics/refinery@5d0f63f]: deploying to launch page preview job (duration: 06m 34s) [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 after alter table (duration: 01m 16s) [15:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427693 (owner: 10Marostegui) [15:52:47] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4143132 (10CCogdill_WMF) We're collecting click engagement off fundraising emails (actual fundraising appeals, or informational newsletter emails) that... [15:53:35] RECOVERY - Request latencies on neon is OK: (C)1e+05 ge (W)5e+04 ge 2.088e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:53:55] RECOVERY - etcd request latencies on chlorine is OK: (C)5e+04 ge (W)3e+04 ge 4320 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:54:15] RECOVERY - etcd request latencies on neon is OK: (C)5e+04 ge (W)3e+04 ge 3208 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:58:22] (03PS2) 10Gehel: maps: disable OSM replication during tile regeneration [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) [15:58:44] (03CR) 10Gehel: [C: 032] "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/10986/" [puppet] - 10https://gerrit.wikimedia.org/r/427691 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [15:59:35] (03PS4) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 [16:00:05] godog, moritzm, and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:04:20] (03PS2) 10ArielGlenn: keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) [16:05:42] RECOVERY - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-b valid until 2018-08-17 16:11:09 +0000 (expires in 120 days) [16:06:52] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.118 port 9042 [16:07:41] (03CR) 10Filippo Giunchedi: Target kafka jmx exporters by profiles instead of roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [16:08:16] (03PS5) 10Andrew Bogott: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 [16:12:41] (03CR) 10Ottomata: Target kafka jmx exporters by profiles instead of roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [16:12:43] (03PS2) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 [16:12:52] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:13:12] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:13:22] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [16:13:23] PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [16:13:43] (03CR) 10Ottomata: "BTW, this will help avoid bugs like https://gerrit.wikimedia.org/r/#/c/427596/" [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [16:15:16] (03PS3) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 [16:16:12] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [16:16:23] RECOVERY - cassandra-c service on restbase1016 is OK: OK - cassandra-c is active [16:16:51] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, IMO better to use find in this case" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [16:20:33] !log shutting down tilerator on maps[12].* for maintenance - T191655 [16:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:39] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [16:20:53] RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2018-08-17 16:11:29 +0000 (expires in 119 days) [16:21:22] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.34 port 9042 [16:22:12] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:22:12] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused [16:22:23] PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:22:23] PROBLEM - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:24:12] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:00] !log restarting tilerator on maps[12].* - T191655 [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:06] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [16:28:17] (03CR) 10Muehlenhoff: [C: 031] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [16:28:55] (03CR) 10Filippo Giunchedi: "LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [16:29:37] (03CR) 10Filippo Giunchedi: "> LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [16:31:59] (03PS1) 10Elukey: Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557) [16:33:12] 41 hosts to go, will be looong... [16:34:14] *pfft*, we have about 250 mw* servers to reimage :-) [16:34:53] * elukey cries in a corner [16:34:59] automate all the things! [16:35:35] * elukey asks to Reedy some mercy [16:35:37] :D [16:36:16] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4143289 (10Gehel) >>! In T185504#4131278, @Dzahn wrote: > `ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", S... [16:37:32] moritzm: let me know if I can help with the mw reimages, I can schedule some time during the next weeks to work on it [16:43:14] 10Operations, 10Puppet: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4143318 (10herron) a:03herron That's odd, will spin up a test instance and attempt to reproduce [16:45:32] RECOVERY - Check systemd state on restbase1007 is OK: OK - running: The system is fully operational [16:46:13] RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active [16:46:50] (03CR) 10Krinkle: [C: 031] "LGTM, and confirmed by running `php -S localhost:34343` and checking via http://localhost:34343/docroot/noc/db.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott) [16:49:22] elukey: it's fine, large scale reimages can start when we've fully sorted out the memcached situation, right now I'm mostly rolling out some systems to catch potential regressions [16:50:14] !log uploaded tidy-0.99 to component/ci for apt.wikimedia.org/stretch-wikimedia (T191771) [16:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:20] T191771: [REL1_30] Some parserTests fail on debian stretch using Tidy, because of a new version of libtidy - https://phabricator.wikimedia.org/T191771 [16:50:41] moritzm: ack [16:52:13] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on 10.64.0.230 port 9042 [16:52:33] RECOVERY - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is OK: SSL OK - Certificate restbase1007-a valid until 2018-08-17 16:10:53 +0000 (expires in 119 days) [16:59:00] (03PS7) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:55] (03PS8) 10Dzahn: releases-parsoid: setup rsync between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) [17:01:14] (03CR) 10Dzahn: [C: 032] "+ auto_sync cron job" [puppet] - 10https://gerrit.wikimedia.org/r/427674 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [17:02:33] (03CR) 10Cmjohnson: [C: 032] Adding dns for db1116-1123 [dns] - 10https://gerrit.wikimedia.org/r/427536 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [17:02:35] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3065109 (10Imarlier) @fgiunchedi Mentioned in #wikimedia-perf that he thought he remembered there being a reason why submitting metrics via graphite wouldn't work. First, her... [17:05:58] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4143465 (10Dzahn) In Hiera it is defined which is the currently "active" rel... [17:06:31] (03PS1) 10Herron: install_server: reinstall mx2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361) [17:07:33] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4143477 (10Dzahn) @ssastry I think this should have resolved the ticket. See... [17:07:38] (03CR) 10Herron: [C: 04-2] "not to be merged until mx2001 is depooled" [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [17:13:43] (03CR) 10Andrew Bogott: [C: 032] Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott) [17:15:07] (03Merged) 10jenkins-bot: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott) [17:18:05] !log andrew@tin Synchronized docroot/noc/db.php: Moving labtestwikitech to m5, step 1 (duration: 01m 16s) [17:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:07] (03CR) 10jenkins-bot: Move labtestwikitech from a local db to m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427690 (owner: 10Andrew Bogott) [17:20:13] !log andrew@tin Synchronized wmf-config/db-codfw.php: Moving labtestwikitech to m5, step 2 (duration: 01m 16s) [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:44] !log andrew@tin Synchronized wmf-config/db-eqiad.php: Moving labtestwikitech to m5, step 3 (duration: 01m 16s) [17:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:50] (03CR) 10Elukey: [C: 032] Release 0.10.0-3~jessie [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/427657 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [17:33:31] (03PS1) 10Andrew Bogott: m5: allow labtestweb2001 mysql access [puppet] - 10https://gerrit.wikimedia.org/r/427720 (https://phabricator.wikimedia.org/T192339) [17:33:44] (03PS4) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 [17:35:55] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4143586 (10herron) Ok, test mx instance is looking good. Will plan to depool and reinstall mx2001 with Stretch next week. @ayounsi could we coordinate a time to reject connections to mx200... [17:37:27] (03CR) 10Andrew Bogott: [C: 032] m5: allow labtestweb2001 mysql access [puppet] - 10https://gerrit.wikimedia.org/r/427720 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [17:42:52] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4143632 (10Andrew) [17:45:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 280.83 seconds [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:11] * thcipriani uses window to catch-up train [18:11:09] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143711 (10Nuria) @meeps: please note that the ticket you are linking to is also a request for access where I noted that these tools require different access levels, ALL of them giv... [18:16:10] AndyRussG: so I just looked at tin and saw that https://gerrit.wikimedia.org/r/#/c/427235/ was fetched for wmf.30 but not checked out [18:16:40] thcipriani: hi! [18:16:42] one sec [18:16:46] ok [18:17:16] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4143719 (10ayounsi) [18:18:34] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4062522 (10ayounsi) Verified that external monitoring doesn't do ping checks (but http, etc. instead) to hostnames (en.wikipedia.org, etc). Added a Watchmouse ping check for... [18:18:38] thcipriani: the intention was that that patch, as well as https://gerrit.wikimedia.org/r/#/c/427439/, would go out during yesterday's morning SWAT [18:19:20] stuff happened, not related to any issues with those patches, and somehow stuff only made it to group0. I didn't follow what happened with the train [18:19:51] Both those can be pushed out everywhere with wmf.30, if it's not a bother [18:20:38] I can make them live now, but it doesn't look like they are currently live anywhere [18:20:46] do you want to test them? [18:20:58] thcipriani: they are, or were, on mediawiki.org yesterday [18:21:17] We did test the first (the one you mentioned) on prod, via mewdebug1002 [18:21:48] The second one can only be really tested on meta, which wasn't updated. But it's very, very minimal, so I'd recommend just going ahead and pushing it all out [18:22:11] hrm hold on one second, maybe the way this was fetched was just weird [18:22:41] Hmmm [18:22:45] Anyway, once the train has traversed its silicone track (and full scap is done) I can double-check that it's all good [18:25:25] AndyRussG: false alarm. So when I fetched down changes for mediawiki core for 1.30.0-wmf.30 the submodule bumps for CentralNotice came down, but I guess the changes for the actual extension were already fetched, just not the submodule bumps on core [18:25:58] tl;dr: git looked weird, but wmf.30 looks up-to-date on the appservers [18:32:19] testwki is now broken? [18:32:25] https://test.wikipedia.org/wiki/File:Rxy.svg.png [18:32:46] [WtjgxgpAICsAAFIqsawAAADM] /wiki/File:Rxy.svg.png InvalidArgumentException from line 875 of /srv/mediawiki/php-1.31.0-wmf.30/includes/libs/rdbms/loadbalancer/LoadBalancer.php: No server with index 'Array'. [18:32:57] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4143748 (10mmodell) @fgiunchedi: I'm not sure, the commit says it's for py3 transition, but I'm not sure why it matters. @demon, can... [18:32:58] !log thcipriani@tin Synchronized php-1.31.0-wmf.30/resources/src/jquery: [[gerrit:427709|jquery.makeCollapsible: Only add "[" "]" to autogenerated toggles]] T192140 (duration: 01m 17s) [18:33:02] works for me [18:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:05] T192140: Square brackets shown around the expand/collapse icons on RC - https://phabricator.wikimedia.org/T192140 [18:33:17] hmm private mode is ok.. [18:35:50] https://phabricator.wikimedia.org/P7018 [18:36:44] probably occurs error to file uploader? [18:38:00] alt account is ok. [18:38:19] rxy is still occurs error [18:38:46] rxy: could you file a task for that? I definitely see that error in the logs. [18:39:09] k... [18:45:37] https://phabricator.wikimedia.org/T192584 [18:45:44] thanks! [18:45:49] thanks too [18:51:00] added as a blocker for 1.31.0-wmf.30 train rollout [18:51:25] thx :) [18:51:37] thcipriani: I've found it [18:51:38] public static function newFromConds( [18:51:38] $conds, [18:51:38] $fname = __METHOD__, [18:51:39] $dbType = DB_REPLICA [18:51:41] ) { [18:51:46] in Article.php [18:51:48] [ 'USE INDEX' => 'rc_timestamp' ] [18:52:30] https://github.com/wikimedia/mediawiki/blame/master/includes/page/Article.php#L1069 [18:52:37] I guess RecentChange has changed [18:53:35] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4143821 (10RStallman-legalteam) Tim's NDA is fully signed and on file with legal. Thanks! [18:57:37] It kinda looks like the code hasn't changed recently [18:59:25] hmm same error at mediawiki.org too [18:59:34] is this just a longstanding bug that's surfacing just now? [18:59:54] It could be, yup [19:00:04] thcipriani: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:26] so far my horoscope is correct, I guess. [19:00:45] sorry for find out this... or .. it is good timing? [19:02:18] https://gerrit.wikimedia.org/r/#/c/427751/ [19:02:24] That'll fix it... [19:02:50] It's almost like there's a parameter been removed from RecentChange::newFromConds [19:03:27] thcipriani: Unless something has changed with patrolling config recently [19:03:38] That'd probably be the more likely answer [19:05:56] "something has changed with patrolling config recently"-> https://phabricator.wikimedia.org/T184791 > [19:05:59] ? [19:08:07] thcipriani: https://gerrit.wikimedia.org/r/427755 want to try that on .30? [19:09:36] Reedy: sure [19:09:58] Looks very likely that's a bug that's been sitting there for yeaaaars [19:11:02] interesting [19:15:41] (03PS2) 10Dzahn: Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad) [19:16:08] (03CR) 10Dzahn: [C: 032] Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad) [19:17:14] (03PS7) 10Dzahn: Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox) [19:18:56] (03CR) 10Dzahn: [C: 032] Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) (owner: 10Paladox) [19:19:00] thanks :) [19:19:16] Teamwork! [19:19:50] yep, paladox :) [19:19:56] :) [19:20:03] applied on gerrit2001 [19:20:20] no_justification wondering could you do that gerrit wmfcontent thing please :) [19:20:25] domain) [19:22:23] thcipriani: it merged [19:22:28] !log gerrit: restarting services to pick up gc & indexing changes [19:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:01] Reedy: just saw, but ^ gerrit is restarting so I can't fetch just now :) [19:23:06] lols [19:26:21] gerrrit back [19:27:12] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:27:20] (03PS3) 10Dzahn: Gerrit: Move all logging to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [19:27:55] Reedy: ok, got it, pulled over to mwdebug1002 [19:28:00] anything to test there? [19:28:01] (03CR) 10Paladox: [C: 031] "I've been running this on gerrit-test3 for a long while 2+ weeks now. and has been working." [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [19:28:16] thcipriani: You can visit the file page that rxy linked and check it doesn't blow up [19:28:21] (it should work) [19:28:47] that will require another restart, as log4j 1.x does not reload like log4j2 does [19:32:03] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build] [19:32:03] Reedy: pretty sure that page always did work for me [19:32:09] orly? [19:32:11] Let me check [19:32:25] https://test.wikipedia.org/wiki/File:Rxy.svg.png broken normally... [19:32:34] good on mwdebug1002 [19:32:36] tested [19:32:37] SHIP IT [19:32:49] :) [19:32:50] k [19:33:16] Depending on how much longer .29 is hanging around can decide whether we merge to there too [19:34:51] hopefully wmf.29 will be gone today, but at the rate I'm going... [19:35:33] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4143951 (10AndyRussG) >>! In T192472#4143711, @Nuria wrote: > @meeps: please note that the ticket you are linking to is also a request for access where I noted that these tools requ... [19:35:47] !log thcipriani@tin Synchronized php-1.31.0-wmf.30/includes/page/Article.php: [[gerrit:427755|Do not pass USE INDEX to a $dbType parameter]] T192584 (duration: 01m 17s) [19:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:53] T192584: Error occurs in file page for Own uploaded files@1.31.0-wmf.30 (e8360e8) - https://phabricator.wikimedia.org/T192584 [19:36:44] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4143977 (10Eevans) Update: The decommission of restbase1010-c was discontinued after other instances in the rack began to fail... [19:38:26] (03PS1) 10Herron: puppetdb: add service enable => true [puppet] - 10https://gerrit.wikimedia.org/r/427772 (https://phabricator.wikimedia.org/T192531) [19:39:10] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 [19:39:22] 10Operations, 10Puppet, 10Patch-For-Review: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4143992 (10herron) Sure enough, it's reproducible. Looks like the `systemd::service` entry used previously automatically set `enable => true` when called with `ensure => present` whic... [19:40:57] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani) [19:41:26] Reedy: ok for now . thanks :) [19:42:11] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani) [19:45:07] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.30 [19:45:19] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144006 (10Pnorman) The answer in general is it depends. Are you looking for monitoring to diagnose problems, or alarms for health? I would recommend monitoring for maximum tra... [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:19] (03PS6) 10Paladox: Gerrit: Add url for avatars and setups gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) [19:49:17] (03CR) 10Hashar: "That requires the cumin master to use Stretch. On Jessie there are a bunch of apt / dependencies issues." [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [19:49:24] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427773 (owner: 10Thcipriani) [19:53:08] !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.30 (duration: 01m 15s) [19:53:13] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144043 (10Nuria) Access approved on my end. [19:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:02] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:12] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:03:35] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144066 (10Pnorman) Adding to the above, I would say that most of the other monitoring that can be done can be broken down into performance related metrics, like transactions pe... [20:09:22] thcipriani: how is the train going? [20:09:55] hashar: group0 + group1 done. Will roll forward all wikis shortly. [20:10:10] thcipriani: I will bring back quibble after taht I guess [20:10:40] hashar: ok, I'll ping you when I'm done. [20:13:27] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144083 (10mepps) Thanks @nuria and @AndyRussG! Is there a next step @Dzahn @Reedy? [20:16:05] (03PS1) 10Andrew Bogott: Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542) [20:20:09] (03Abandoned) 10Gilles: Xenon: don’t generate SVGs for recently modified logs [puppet] - 10https://gerrit.wikimedia.org/r/427665 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [20:21:12] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144102 (10Gilles) Hah, turns out I actually didn't ever truly run xenon-generate-svgs locally on that file, when I did it failed just like in produc... [20:22:45] !log milimetric@tin Started deploy [analytics/refinery@c1c9885]: Correcting hql from last deployment [20:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:26] (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 [20:27:54] !log milimetric@tin Finished deploy [analytics/refinery@c1c9885]: Correcting hql from last deployment (duration: 05m 09s) [20:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] PROBLEM - etcd request latencies on chlorine is CRITICAL: 5.329e+04 ge 5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:29:53] (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani) [20:31:07] (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani) [20:31:32] RECOVERY - etcd request latencies on chlorine is OK: (C)5e+04 ge (W)3e+04 ge 3796 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:32:38] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.30 [20:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:09] hashar: train is done [20:38:39] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144203 (10Gilles) OK, the error points to the last line, because that's where the file cursor is, but the offending line happened earlier. It's this... [20:40:46] (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427815 (owner: 10Thcipriani) [20:43:42] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:48:12] !log restarting cassandra to (temporarily) rollback prometheus jmx exporter -- T189822, T192456 [20:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:19] T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456 [20:48:20] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [20:48:24] !log restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-a -- T189822, T192456 [20:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:12] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144267 (10Gilles) Finally found the offending line, which is just "1" and happened in the middle of the file. It's possible that it was written by t... [20:55:43] thcipriani: awesome. Bringing quibble back [21:00:19] !move issue move of enwiki_content shard 2 from overloaded elasti1027 to elastic1017 [21:00:23] !log issue move of enwiki_content shard 2 from overloaded elasti1027 to elastic1017 [21:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:00] (03PS1) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) [21:04:18] (03PS2) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) [21:07:52] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4144324 (10ayounsi) @herron Monday 10:30am PDT? (5:30pm UTC) How long will the block be installed for? [21:08:52] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:11:56] !log restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-c -- T189822, T192456 [21:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:05] T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456 [21:12:05] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [21:15:54] !log Start cleanup, restbase10{07,11,16}-a -- T189822 [21:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:52] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4144383 (10Dzahn) @mepps Yea, the next step would be that we need a SSH key from you. Could you create one (https://wikitech.wikimedia.org/wiki/Production_shell_access#SSH_Key_Requi... [21:22:38] !log Start cleanup, restbase10{07,11,16}-b -- T189822 [21:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:45] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [21:25:54] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4144406 (10herron) >>! In T175361#4144324, @ayounsi wrote: > @herron Monday 10:30am PDT? (5:30pm UTC) > How long will the block be installed for? Sounds good! Barring any unexpected issues... [21:26:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:27:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:27:57] wikipedia is sloww [21:28:31] https://en.wikipedia.org/wiki/IOS [21:28:33] is slow [21:29:30] did you try with https://en.wikipedia.org/wiki/Microsoft_Windows ? :) [21:29:37] I use mac [21:29:45] yeh [21:29:48] i went to that page [21:29:56] and clicking edit is causing it to slowly load [21:30:04] I can confirm the loading issues here too [21:31:09] page loads fine here [21:31:30] Seems to work now. [21:32:09] yea, it seems to one of those quirks that esams has had for quite a while now [21:33:46] ack, there was a short spike of more 5xx on the graph linked above [21:37:15] (03CR) 10Herron: [C: 032] puppetdb: add service enable => true [puppet] - 10https://gerrit.wikimedia.org/r/427772 (https://phabricator.wikimedia.org/T192531) (owner: 10Herron) [21:37:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:38:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:41:45] !log Start cleanup, restbase10{07,11,16}-c -- T189822 [21:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:52] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [21:45:07] 10Operations, 10Puppet, 10Patch-For-Review: puppetdb does not start up on reboot - https://phabricator.wikimedia.org/T192531#4144420 (10herron) 05Open>03Resolved Fixed! ``` Notice: /Stage[main]/Puppetdb::App/Service[puppetdb]/enable: enable changed 'false' to 'true' ``` ``` UNIT FILE... [21:45:46] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144422 (10Dzahn) Thanks @Gehel and @pnorman! I would say let's start with this one: >>! In T185504#4143289, @Gehel wrote: > It looks like the script is trying to connect over... [21:48:18] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4144425 (10Krinkle) I don't know if the cited portion was changed unintentionally, but that sample does not show a `1` on its own line. It shows a `1... [21:50:26] (03CR) 10Krinkle: Filter out invalid records in xenon-log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [21:51:13] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4144436 (10Gehel) >>! In T185504#4144422, @Dzahn wrote: > Looking at the netbox module it seems /etc/postgresql/9.6/main/pg_hba.conf isn't puppetized while it does contain custo... [21:52:02] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4144438 (10MarcoAurelio) The error mentioned is gone, thanks. However we still have issues: T1... [22:03:52] mutante: I can confirm the 5xx spike on load.php as well [22:03:55] Do we know what caused it? [22:05:20] about 10K failed requests in total over that 10min to load.php [22:06:07] (9min*60sec*19 503s/sec) [22:06:18] https://grafana.wikimedia.org/dashboard/db/resourceloader?refresh=5m&orgId=1&from=1524171141986&to=1524174448108 [22:06:25] It's 0 before and after that spike [22:07:51] logstash mediawiki-errors went from 4K/min to 20K/min and has not come down since [22:07:57] twentyafterfour: I think that means train rollback, right? [22:08:25] https://usercontent.irccloud-cdn.com/file/GJ4Kc29S/Screen%20Shot%202018-04-19%20at%2023.08.16.png [22:09:10] the logstash shape does seem to roughly take off same time as "Group1 to 1.31.0-wmf.30" [22:10:29] It seems 95% of it is this one: domain=127.0.0.1 url=runJobs.php channel=CirrusSearch message="Search backend error" [22:11:18] ebernhardson: might be related to your es changes? [22:11:33] https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors last 4 hours [22:12:29] (03PS4) 10BBlack: ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [22:14:57] (03CR) 10BBlack: [C: 032] ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [22:16:19] (03PS1) 10BBlack: fb traffic experiment: reduce to 1h [puppet] - 10https://gerrit.wikimedia.org/r/427821 [22:16:36] (03CR) 10BBlack: [V: 032 C: 032] fb traffic experiment: reduce to 1h [puppet] - 10https://gerrit.wikimedia.org/r/427821 (owner: 10BBlack) [22:18:25] (03PS4) 10BBlack: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema) [22:19:00] (03CR) 10BBlack: [C: 032] Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema) [22:25:38] I'm on train this week, I can roll back, but I didn't see a spike in fatal monitor, didn't realize there was a spike in mediawiki-error rate [22:30:27] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 and group2 wikis back to 1.31.0-wmf.29 [22:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:43] PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [22:30:43] PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:30:43] PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:30:43] PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:30:43] PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:31:02] ok, we're back on wmf.29 for group1 and group2 [22:31:42] PROBLEM - Varnish frontend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:31:42] PROBLEM - Varnish backend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:31:58] this seems more like a prometheus failure than a varnish failure, but looking [22:32:38] (perhaps those should be UNKNOWN rather than CRITICAL?) [22:34:32] yeah, bast3002 seems in some kind of trouble (which is where prometheus goes through) [22:34:41] load average: 16.68 [22:36:02] PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:36:02] PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:36:02] PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:36:02] PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:36:02] PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:36:03] PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:36:03] PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:36:04] PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:36:04] PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:36:05] PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:36:05] PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:36:06] PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:36:12] PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:36:12] PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:36:12] PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:36:12] PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:36:13] PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:36:13] PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:36:13] PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:36:14] PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:36:14] PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:36:15] PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:36:32] PROBLEM - Varnish frontend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [22:36:32] PROBLEM - Varnish backend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:36:32] PROBLEM - PyBal BGP sessions are established on lvs3001 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:36:33] PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:36:33] PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:36:33] PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:36:33] PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:36:34] PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:36:34] PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:36:35] PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:36:35] PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:36:42] PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:36:42] PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:36:53] prometh+ 9406 1 99 Feb22 ? 100-20:00:44 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/ops/metrics -web.listen-address 127.0.0.1:9900 -web.external-url http://prometheus/ops -storage.local.retention 2190h0m0s -config.file /srv/prometheus/ops/prometheus.yml -storage.local.chunk-encoding-version [22:36:59] 2 [22:37:13] ^ thsi bit of prometheus, seems to be locking up on CPU% and causing huge iowait on disk [22:38:02] RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:38:02] RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:38:02] RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:38:02] RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:38:02] RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:38:03] RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:38:03] RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:38:04] RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:38:04] RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:38:05] RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:38:05] RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:38:06] RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:38:06] RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:38:06] Krinkle: that failure isn't related to the prior latency issue that i moved an index for, it's something else that also happened for 30 minutes overnite that i noticed in our dashboards. Not sure yet what it is. [22:38:07] RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:38:12] RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:38:12] RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:38:12] RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:38:12] RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:38:12] RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:38:13] RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:38:23] RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:38:23] RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:38:23] RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:38:24] RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:38:24] RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:38:25] RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:38:25] RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:38:32] RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:38:32] RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:43:12] PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:43:12] PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:43:12] PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:43:12] PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:43:12] PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:43:13] PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:43:13] PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:43:14] PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:43:14] PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:43:15] PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:43:15] PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:43:16] PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:43:21] ridiculous :P [22:43:22] PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:43:22] PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:43:22] PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:43:23] PROBLEM - Varnish frontend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:43:23] PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:43:23] PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:43:23] PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:43:24] PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:43:24] PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:43:25] PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:43:42] PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:43:42] PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:43:42] PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:43:42] PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:43:42] PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:43:43] PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:43:43] PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:43:44] PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:43:44] PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:43:45] PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:44:53] RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:44:53] RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:44:53] RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [22:44:53] RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:44:53] RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:45:02] RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:45:02] RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:45:02] RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:45:02] RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:45:03] RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:45:03] RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:45:03] RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:45:04] RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:45:04] RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:45:05] RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:45:05] RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:45:06] RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:45:06] RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:45:12] RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:45:12] RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:45:12] RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:45:13] RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:45:13] RECOVERY - Varnish backend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:45:13] RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:45:13] RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:45:14] RECOVERY - Varnish frontend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:45:14] RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:47:33] ebernhardson: Krinkle so it seems like whatever is causing https://phabricator.wikimedia.org/T192609 was in wmf.30 as soon as I rolled back those warnings stopped, still happening in group0 wikis. Adding as a train blocker, FYI. [22:47:54] thx [22:48:19] thcipriani: btw, it seems fatal-monitor (on logstash) doesn't include type:mw/channel:error [22:48:29] That'll be needed as soon those will no be in type:hhvm anymore [22:48:32] (separate from php7 migration) [22:48:52] mw itself has a setting to sent to channel=error without also sending to php stderr (type:hhvm currently) [22:48:56] thcipriani: i have one suspicion, a patch that changed how we serialize jobs to work with the new job queue. looking into it [22:48:59] which is planned to be turned on and alreayd on in beta [22:49:17] ebernhardson: thank you! [22:49:53] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144571 (10BBlack) [22:49:55] Krinkle: k, we'll need to update fatal monitor as well as the logstash_checker script scap uses to check canaries for error rate spikes. [22:49:59] Might want to restore https://gerrit.wikimedia.org/r/#/c/427759/ and merge/deploy if 1.29 is hanging around [22:50:02] PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [22:50:02] PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:50:02] PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:50:02] PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:50:03] PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:50:04] .29 [22:50:12] PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:50:12] PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:50:13] PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:50:13] PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:50:13] PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:50:13] PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:50:13] PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:50:14] PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:50:14] PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:50:15] PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:50:15] PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:50:16] PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:50:16] PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:50:21] uh, not that one [22:50:22] PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:50:22] PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:50:22] PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:50:30] https://gerrit.wikimedia.org/r/#/c/427754/ [22:50:32] PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:50:32] PROBLEM - Varnish frontend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:50:32] PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:50:32] PROBLEM - Varnish backend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:50:32] PROBLEM - Varnish backend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:50:32] (03PS1) 10Thcipriani: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 [22:50:33] PROBLEM - Varnish frontend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:50:33] PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:50:33] and there it goes again, trying a daemon restart [22:50:34] PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.esams.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:51:52] RECOVERY - Varnish backend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:51:52] RECOVERY - Varnish frontend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:51:52] RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:51:53] RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [22:51:53] RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:51:53] RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:51:53] RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [22:51:54] RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:52:03] RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:52:03] RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:52:12] RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [22:52:12] RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:52:12] RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [22:52:12] RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:52:12] RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [22:52:13] RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [22:52:13] RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:52:14] RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:52:14] RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:52:15] RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [22:52:15] RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:52:16] RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:52:16] RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [22:52:17] RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:52:22] RECOVERY - Varnish backend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [22:52:23] RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:52:34] (03CR) 10Thcipriani: [C: 032] Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani) [22:52:42] RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [22:52:42] RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:52:42] RECOVERY - Varnish backend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [22:52:42] RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [22:52:42] RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:52:43] RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:52:43] RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:52:45] RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [22:52:45] RECOVERY - PyBal BGP sessions are established on lvs3001 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:52:45] RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [22:53:47] (03Merged) 10jenkins-bot: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani) [22:55:03] (03PS1) 10Thcipriani: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 [22:55:42] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144593 (10BBlack) It did keep spamming by the time I got done writing the above. Attempting to stop it now, but the basic daemon "stop" operation via systemctl is taking quite a long time (over 3 minute... [22:57:40] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144601 (10BBlack) And that was followed by this, by the time it finally stopped itself ~5 minutes later: ``` Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Don... [22:59:14] (03CR) 10jenkins-bot: Revert "All wikis to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427826 (owner: 10Thcipriani) [22:59:31] (03CR) 10Thcipriani: [C: 032] Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani) [22:59:33] PROBLEM - Varnish backend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [22:59:42] PROBLEM - Varnish frontend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [22:59:42] PROBLEM - Varnish frontend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [22:59:42] PROBLEM - Varnish frontend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [22:59:42] PROBLEM - Varnish frontend child restarted on cp3046 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [22:59:43] PROBLEM - Varnish frontend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [22:59:52] PROBLEM - Varnish frontend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [23:00:02] PROBLEM - Varnish backend child restarted on cp3043 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [23:00:02] PROBLEM - Varnish backend child restarted on cp3047 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [23:00:02] PROBLEM - Varnish backend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [23:00:02] PROBLEM - Varnish frontend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [23:00:02] PROBLEM - Varnish backend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [23:00:03] PROBLEM - Varnish frontend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [23:00:03] PROBLEM - Varnish frontend child restarted on cp3031 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180419T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:12] PROBLEM - Varnish backend child restarted on cp3008 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [23:00:12] PROBLEM - Varnish backend child restarted on cp3039 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [23:00:13] PROBLEM - Varnish backend child restarted on cp3032 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [23:00:13] PROBLEM - Varnish frontend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [23:00:13] PROBLEM - Varnish backend child restarted on cp3037 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [23:00:13] PROBLEM - Varnish backend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [23:00:13] PROBLEM - Varnish frontend child restarted on cp3038 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [23:00:14] PROBLEM - Varnish frontend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [23:00:14] PROBLEM - Varnish backend child restarted on cp3036 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [23:00:15] PROBLEM - Varnish frontend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [23:00:15] PROBLEM - PyBal BGP sessions are established on lvs3004 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:00:16] PROBLEM - PyBal BGP sessions are established on lvs3002 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:00:22] PROBLEM - Varnish backend child restarted on cp3007 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [23:00:22] PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [23:00:23] PROBLEM - Varnish backend child restarted on cp3035 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [23:00:23] PROBLEM - Varnish frontend child restarted on cp3040 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [23:00:23] PROBLEM - Varnish frontend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [23:00:23] PROBLEM - Varnish backend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [23:00:23] PROBLEM - Varnish frontend child restarted on cp3045 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [23:00:24] PROBLEM - Varnish frontend child restarted on cp3041 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [23:00:24] PROBLEM - Varnish backend child restarted on cp3049 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [23:00:42] PROBLEM - Varnish frontend child restarted on cp3044 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [23:00:43] PROBLEM - Varnish backend child restarted on cp3010 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [23:00:43] PROBLEM - PyBal BGP sessions are established on lvs3001 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:00:43] PROBLEM - Varnish backend child restarted on cp3042 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [23:00:43] PROBLEM - Varnish backend child restarted on cp3033 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [23:00:43] PROBLEM - Varnish backend child restarted on cp3030 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [23:00:43] PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:00:45] (03Merged) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani) [23:00:52] PROBLEM - Varnish backend child restarted on cp3034 is CRITICAL: http://prometheus.svc.esams.wmnet/ops/api/v1/query error while decoding json: Expecting value: line 1 column 1 (char 0) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [23:02:15] (03PS2) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) [23:04:51] !log thcipriani@tin Synchronized php: complete group1 and group2 wikis back to 1.31.0-wmf.29 (duration: 01m 16s) [23:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:57] (03CR) 10jenkins-bot: Revert "Group1 to 1.31.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427828 (owner: 10Thcipriani) [23:08:46] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144640 (10BBlack) It seems to be having problems coming up cleanly too, so more spam. First chunk of startup logs: ``` Apr 19 22:58:02 bast3002 systemd[1]: Starting prometheus server (instance ops)...... [23:09:30] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144641 (10BBlack) [also, I've downtimed all the esams-specific prometheus-based alerts in icinga for 24h now (varnish child-counting checks and pybal bgp checks)] [23:09:41] hopefully no more spam. worst case eventually one more wave of RECOVERY [23:10:43] RECOVERY - PyBal BGP sessions are established on lvs3001 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:10:52] RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:10:53] RECOVERY - Varnish frontend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [23:11:02] RECOVERY - Varnish backend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [23:11:12] RECOVERY - Varnish frontend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [23:11:13] RECOVERY - Varnish frontend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [23:11:17] it's always the worst case :P [23:11:22] RECOVERY - PyBal BGP sessions are established on lvs3004 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:11:22] RECOVERY - PyBal BGP sessions are established on lvs3002 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams%2520prometheus%252Fops [23:11:23] RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [23:11:24] RECOVERY - Varnish backend child restarted on cp3007 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3007&var-datasource=esams+prometheus/ops [23:11:32] RECOVERY - Varnish frontend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [23:11:32] RECOVERY - Varnish frontend child restarted on cp3043 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams+prometheus/ops [23:11:32] RECOVERY - Varnish backend child restarted on cp3049 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3049&var-datasource=esams+prometheus/ops [23:11:32] RECOVERY - Varnish frontend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [23:11:32] RECOVERY - Varnish frontend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [23:11:33] RECOVERY - Varnish frontend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [23:11:43] RECOVERY - Varnish frontend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [23:11:43] RECOVERY - Varnish backend child restarted on cp3031 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3031&var-datasource=esams+prometheus/ops [23:11:43] RECOVERY - Varnish frontend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [23:11:43] RECOVERY - Varnish backend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [23:11:43] RECOVERY - Varnish frontend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [23:11:45] bblack: well you jinxed it by mentioning it [23:11:47] in any case, these are all downtimed for 24h now, they're just re-alerting recovery because the last time they failed was before the downtime was set [23:11:52] RECOVERY - Varnish backend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [23:11:52] RECOVERY - Varnish backend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [23:11:52] RECOVERY - Varnish backend child restarted on cp3030 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams+prometheus/ops [23:11:52] RECOVERY - Varnish frontend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [23:11:52] RECOVERY - Varnish frontend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [23:11:53] RECOVERY - Varnish frontend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [23:11:53] RECOVERY - Varnish backend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [23:12:02] RECOVERY - Varnish backend child restarted on cp3044 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3044&var-datasource=esams+prometheus/ops [23:12:02] RECOVERY - Varnish frontend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [23:12:03] RECOVERY - Varnish backend child restarted on cp3047 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3047&var-datasource=esams+prometheus/ops [23:12:03] RECOVERY - Varnish backend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [23:12:12] RECOVERY - Varnish frontend child restarted on cp3042 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3042&var-datasource=esams+prometheus/ops [23:12:13] RECOVERY - Varnish frontend child restarted on cp3034 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3034&var-datasource=esams+prometheus/ops [23:12:13] RECOVERY - Varnish backend child restarted on cp3039 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3039&var-datasource=esams+prometheus/ops [23:12:13] RECOVERY - Varnish backend child restarted on cp3008 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3008&var-datasource=esams+prometheus/ops [23:12:13] RECOVERY - Varnish frontend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [23:12:13] RECOVERY - Varnish frontend child restarted on cp3038 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams+prometheus/ops [23:12:14] RECOVERY - Varnish backend child restarted on cp3037 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3037&var-datasource=esams+prometheus/ops [23:12:14] RECOVERY - Varnish backend child restarted on cp3032 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams+prometheus/ops [23:12:22] RECOVERY - Varnish backend child restarted on cp3036 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3036&var-datasource=esams+prometheus/ops [23:12:22] RECOVERY - Varnish backend child restarted on cp3040 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams+prometheus/ops [23:12:32] RECOVERY - Varnish backend child restarted on cp3045 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3045&var-datasource=esams+prometheus/ops [23:12:32] RECOVERY - Varnish backend child restarted on cp3035 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3035&var-datasource=esams+prometheus/ops [23:12:32] RECOVERY - Varnish backend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [23:12:32] RECOVERY - Varnish frontend child restarted on cp3010 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3010&var-datasource=esams+prometheus/ops [23:12:33] RECOVERY - Varnish backend child restarted on cp3046 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3046&var-datasource=esams+prometheus/ops [23:12:33] RECOVERY - Varnish frontend child restarted on cp3041 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3041&var-datasource=esams+prometheus/ops [23:13:09] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4144648 (10BBlack) Crash recovery appears to have completed at about 23:10:33 and things came back online. We'll see if it remains stable. Leaving the downtimes in place to avoid more spamming of IRC. [23:13:30] !log ebernhardson@tin Synchronized php-1.31.0-wmf.30/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 17s) [23:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:36] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:16:36] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue/caching issues - https://phabricator.wikimedia.org/T192473#4144668 (10EddieGP) p:05Low>03High Indeed, the jobqueue on beta is still broken, although... [23:16:56] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4144670 (10demon) That was part of that commit. I was kinda following the example set by the conftool package. If this is problemati... [23:16:56] !log ebernhardson@tin Synchronized php-1.31.0-wmf.29/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 18s) [23:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:31] we have over 300 wikis now [23:27:40] lfn.wp was 300 in my count [23:27:48] wikipedias i mean, sorry