[00:00:04] <jouncebot>	 twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T0000).
[00:03:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.4 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:10:57] <mutante>	 SMalyshev: a little while ago the number of threads on wdqs just kept going up a lot
[00:11:08] <mutante>	 so no, i dont think network problems
[00:11:48] <mutante>	 queries per second did not go up though
[00:12:23] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.22 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:14:22] <mutante>	 SMalyshev: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&from=1538603824853&to=1538611796920&panelId=22&fullscreen&var-cluster_name=wdqs&refresh=1m
[00:27:47] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Klein) It worked! Thank you all! :)
[00:41:39] <wikibugs>	 (03PS4) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[00:42:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[00:50:00] <wikibugs>	 (03PS5) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[00:59:20] <wikibugs>	 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10faidon) Note that we bought OpenGear adapters for the all the Atlases across all sites (incl. ulsfo) last year and shipped them to eqiad: T166715#3308801
[00:59:56] <wikibugs>	 (03PS6) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:02:03] <wikibugs>	 (03CR) 10Dzahn: "converted upload_rewrite to a struct." [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:16:03] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.003 second response time
[01:24:15] <wikibugs>	 (03PS7) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:24:17] <wikibugs>	 (03PS4) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:25:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:25:25] <wikibugs>	 (03CR) 10Dzahn: "PS4: upload_rewrite is now a struct instead of a string" [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:27:45] <wikibugs>	 (03PS8) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:27:47] <wikibugs>	 (03PS5) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:27:49] <wikibugs>	 (03PS3) 10Dzahn: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:32:54] <wikibugs>	 (03PS9) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:36:13] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time
[01:37:49] <wikibugs>	 (03PS6) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:38:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:40:43] <wikibugs>	 (03PS7) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:41:23] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time
[01:41:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:47:40] <wikibugs>	 (03PS8) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[01:49:22] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time
[01:54:02] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time
[01:58:43] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time
[02:07:52] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.001 second response time
[02:09:42] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time
[02:10:43] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.366 second response time
[02:10:53] <icinga-wm>	 PROBLEM - High lag on wdqs2001 is CRITICAL: 1.076e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:12:12] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.533 second response time
[02:13:13] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.097 second response time
[02:13:43] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time
[02:14:12] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 1.096e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:14:42] <icinga-wm>	 PROBLEM - High lag on wdqs2002 is CRITICAL: 1.098e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:14:52] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.169 second response time
[02:14:56] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.093 second response time
[02:43:04] <SMalyshev>	 !log depooled wdqs2001 to see if it catches up faster
[02:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:53] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:10:43] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:16:14] <wikibugs>	 (03PS8) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187)
[03:16:47] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[03:17:43] <icinga-wm>	 RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 289 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[03:21:05] <SMalyshev>	 !log repooled wdqs2001
[03:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:22:43] <SMalyshev>	 !log depool wdqs2003 to let it catch up
[03:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:11] <wikibugs>	 (03PS9) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187)
[03:30:09] <wikibugs>	 (03PS10) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187)
[03:34:11] <wikibugs>	 (03CR) 10Mathew.onipe: "Jenkins dry run:" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[03:35:23] <icinga-wm>	 RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 780 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[04:01:52] <icinga-wm>	 RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 694 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[04:02:02] <icinga-wm>	 PROBLEM - pdfrender on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:03:40] <wikibugs>	 (03CR) 10Smalyshev: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[04:04:39] <SMalyshev>	 !log repooled wdqs2003
[04:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:07:22] <icinga-wm>	 RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time
[04:21:20] <wikibugs>	 (03PS11) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[05:25:33] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913)
[05:27:31] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui)
[05:29:13] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui)
[05:30:24] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2062 (duration: 00m 57s)
[05:30:25] <marostegui>	 !log Deploy schema change on db2062 - T205913
[05:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:36] <stashbot>	 T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913
[05:30:54] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480
[05:32:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui)
[05:33:29] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui)
[05:34:21] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui)
[05:35:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2062 (duration: 00m 56s)
[05:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:20] <marostegui>	 !log Deploy schema change on db2048 (s1 master) - T205913
[05:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:24] <stashbot>	 T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913
[05:37:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: remove the restart parsoid step, now useless [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto)
[05:48:21] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui)
[05:53:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] sre.switchdc.mediawiki: remove the restart parsoid step, now useless [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto)
[06:04:39] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392)
[06:12:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "Compression finished:" [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[06:13:03] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[06:17:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[06:19:11] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476
[06:21:45] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Aklapper) 05stalled>03Open
[06:22:25] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie)
[06:24:53] <jynus>	 !log create manual backup of databases on eqiad s6, s7, s8, x1
[06:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:13] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui)
[06:25:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05Open>03stalled p:05Triage>03Normal Stalled as the server hasn't been received yet
[06:26:17] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui)
[06:27:58] <wikibugs>	 (03PS1) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595)
[06:29:13] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:29:22] <wikibugs>	 (03PS2) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595)
[06:29:32] <wikibugs>	 (03PS3) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595)
[06:31:52] <wikibugs>	 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10Marostegui) The following hosts (aside from the ones above) will need to be downtimed too: db1117, db2042 and db2078 (they replicate from db1072 and db1073) db2037 (replicates from d...
[06:33:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 (owner: 10Giuseppe Lavagetto)
[06:34:53] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui)
[06:38:02] <wikibugs>	 (03PS1) 10KartikMistry: [WIP] apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482
[06:38:30] <wikibugs>	 (03PS2) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447)
[06:40:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[06:41:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Add nomial resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483
[06:41:52] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483
[06:44:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490
[06:51:07] <jynus>	 !log reenabling consistency configuration on s5 replica databases
[06:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:07] <wikibugs>	 (03PS4) 10Zoranzoki21: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747)
[06:53:23] <wikibugs>	 (03PS5) 10Zoranzoki21: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693)
[06:57:46] <jynus>	 !log starting multisource replication of s3 from s5 at eqiad master
[06:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:24] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:10:53] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:14:13] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:23:13] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[07:23:13] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[07:23:14] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[07:23:23] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[07:23:24] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[07:23:24] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[07:23:43] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[07:23:44] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[07:23:50] <godog>	 that's me
[07:23:50] <_joe_>	 is someone working on ms-be1041?
[07:23:53] <_joe_>	 ok
[07:23:53] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[07:23:54] <_joe_>	 :P
[07:23:54] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[07:24:03] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[07:24:04] <godog>	 dammit I thought I silenced it
[07:24:04] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[07:24:07] <godog>	 sorry about the spam
[07:25:40] <godog>	 !log reformat ms-be1041 with crc=1 finobt=0 - T199198
[07:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:45] <stashbot>	 T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198
[07:28:43] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[07:28:43] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[07:28:44] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[07:28:54] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[07:29:03] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[07:29:03] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[07:29:14] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[07:29:14] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[07:29:23] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[07:29:24] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[07:29:25] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts - https://phabricator.wikimedia.org/T206187 (10Mathew.onipe)
[07:29:34] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[07:29:43] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[07:31:17] <elukey>	 !log move Piwik/Matomo from bohrium to matomo1001 - T202962
[07:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:23] <stashbot>	 T202962: Upgrade bohrium (piwik/matomo) to Debian Stretch - https://phabricator.wikimedia.org/T202962
[07:35:33] <wikibugs>	 (03PS3) 10Elukey: role::cache::text: add a backend for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962)
[07:36:14] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::cache::text: add a backend for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[07:40:50] <wikibugs>	 (03PS1) 10Zoranzoki21: Edited syntax of the code where the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485
[07:40:52] <wikibugs>	 (03CR) 10Mathew.onipe: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[07:41:46] <wikibugs>	 (03PS2) 10Zoranzoki21: Edited syntax of the code where the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485
[07:42:06] <wikibugs>	 (03PS3) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485
[07:43:21] <wikibugs>	 (03PS3) 10Elukey: Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962)
[07:44:07] <wikibugs>	 (03CR) 10Mathew.onipe: "> Patch Set 11: Verified+2" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[07:44:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Some comments inline" (033 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[07:49:42] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui)
[07:52:15] <wikibugs>	 (03CR) 10KartikMistry: "kartik@scb2001:~$ locale -a | grep UTF-8" [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry)
[07:57:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[08:00:37] <gehel>	 !log re-enabling puppet on maps1004
[08:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:33] <wikibugs>	 (03CR) 10Jcrespo: "> but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:09:16] <marostegui>	 !log Restart icinga T196336
[08:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:21] <stashbot>	 T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336
[08:12:32] <wikibugs>	 (03CR) 10Banyek: "> > but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:13:32] <wikibugs>	 (03CR) 10Marostegui: "> > but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:13:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Perfect, then we can use dh_sysuser instead, which reduces this to a very small change, see https://manpages.debian.org/stretch/dh-sysuser" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[08:14:01] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch shard size check on search.svc.codfw.wmnet is CRITICAL: CRITICAL - cebwiki_content_1521724408(51gb) Mathew.onipe This is mostly caused by segment merges - T206187 - The acknowledgement expires at: 2018-10-05 20:10:28.
[08:15:38] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch shard size check on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1537913194(61gb) Mathew.onipe This is mostly caused by segment merges - T206187 - The acknowledgement expires at: 2018-10-05 20:10:12.
[08:20:55] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) Apparently @Mattflaschen-WMF is no more in charge, who is in charge of flow maintenance now, maybe #gro...
[08:34:00] <moritzm>	 !log installing ca-certificates updates for jessie/stretch
[08:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:09] <icinga-wm>	 PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates]
[08:45:13] <wikibugs>	 (03PS1) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[08:46:42] <wikibugs>	 (03PS2) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[08:49:46] <wikibugs>	 (03PS2) 10Volans: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079)
[08:52:18] <moritzm>	 !log installing python2.7/python3.4/python3.5 security updates on jessie/stretch
[08:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:44] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, don't forget to commit the real one in the private repo before merging the puppet change" [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[08:52:56] <wikibugs>	 (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:54:28] <wikibugs>	 (03PS3) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[08:55:42] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think we need another notify too, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron)
[08:58:03] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[08:59:30] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12763/" [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[08:59:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only makes sense in active control node [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177)
[09:00:12] <wikibugs>	 (03PS2) 10Elukey: Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962)
[09:03:00] <wikibugs>	 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Volans) As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icinga load as 1 fleet wide check => 1300 ch...
[09:05:10] <wikibugs>	 (03CR) 10Volans: "FYI I've updated the Switch Datacenter wiki page that was left behind after this change ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto)
[09:05:52] <wikibugs>	 (03CR) 10Elukey: [C: 032] Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[09:06:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[09:07:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:07:51] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only makes sense in active control node [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177)
[09:12:06] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "see comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[09:13:21] <arturo>	 !log T203177 schedule 8h icinga downtime for cloudcontrol1003,1004 and labmon1001
[09:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:26] <stashbot>	 T203177: cloudvps: metrics and analytics  - https://phabricator.wikimedia.org/T203177
[09:14:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "FWIW the collected metrics are not going to be duplicated in the sense that they would have different "instance" tags for each cloudcontro" [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:17:47] <godog>	 arturo: FYI ^
[09:18:57] <arturo>	 godog: something from wikibugs? I ignore it
[09:19:10] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 43.56 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:20:06] <arturo>	 oh a gerrit comment, reading
[09:21:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for Pythons [puppet] - 10https://gerrit.wikimedia.org/r/464494
[09:21:31] <godog>	 arturo: yeah that one
[09:22:29] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:22:30] <arturo>	 godog: what benefit do you see in storing both metrics?
[09:24:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hints for Pythons [puppet] - 10https://gerrit.wikimedia.org/r/464494 (owner: 10Muehlenhoff)
[09:24:05] <godog>	 making sure metrics collection works all the time, regardless of active/passive mostly
[09:24:27] <godog>	 so then when you switch you have to think about one less thing and actually see the switch happening
[09:25:13] <arturo>	 makes sense
[09:26:08] <arturo>	 on the other hand, the passive node is not even that, is just a cold spare which is not expected to go into service unless we have serious issues 
[09:27:10] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: metrics: cleanup unused hiera datafile [puppet] - 10https://gerrit.wikimedia.org/r/464495 (https://phabricator.wikimedia.org/T203177)
[09:28:38] <godog>	 yeah, my point being that when you actually put it in service you already know you have metrics
[09:30:07] <arturo>	 godog: fair enough, I will revert and think on doing some filters in the grafana :-) thanks
[09:30:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/464495 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:32:31] <godog>	 arturo: sounds good! you can either add an "instance" templating variable or depending on the metrics you can query for values > 0 and that should do the right thing
[09:32:49] <godog>	 or display all instances with {{instance}} in the legend template
[09:33:36] <arturo>	 godog: ok, will investigate and ask for your help in don't manage to do it myself :-P
[09:35:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177)
[09:35:20] <arturo>	 godog: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464496 +1 welcome :-)
[09:35:40] <wikibugs>	 (03PS5) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931
[09:37:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:37:13] <godog>	 yup, lgtm!
[09:37:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:37:30] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177)
[09:37:38] <wikibugs>	 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) >>! In T206114#4641114, @Volans wrote: > As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icin...
[09:41:01] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Generally looks ok, small nits around and a couple of questions. Also I've just skimmed the tests as I'm totally not familiar with them an" (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[09:43:13] <wikibugs>	 (03CR) 10Volans: Netbox, set the napalm_username variable and matching keyholder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[09:46:58] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: metrics: adjust depedency on novaenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177)
[09:47:47] <wikibugs>	 (03CR) 10Jcrespo: "Genuine question I don't know, see below" (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[09:47:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudvps: metrics: adjust depedency on novaenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[09:48:58] <wikibugs>	 (03CR) 10Jcrespo: "Another question, sorry for my ignorance." (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[09:52:38] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) @DarTar could you please check this, and if ok, approve it? Thanks!
[09:52:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991)
[09:53:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:57:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991)
[10:02:04] <wikibugs>	 (03PS13) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[10:02:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6294.69 seconds
[10:04:14] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177)
[10:05:38] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177)
[10:06:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[10:06:34] <wikibugs>	 (03PS1) 10Elukey: Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814)
[10:08:14] <moritzm>	 !log rolling reboot of ms-fe hosts in eqiad for kernel security update
[10:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Switch liveness probe into tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464504
[10:10:42] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Set the scaffolding's livenessProbe to tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464505
[10:11:52] <jynus>	 the dbstore1001 lag may be the backups
[10:12:01] <jynus>	 I will disable those alerts and setup a comment
[10:12:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] role::configcluster_stretch: enable etcd replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[10:12:51] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177)
[10:13:42] <wikibugs>	 (03CR) 10Elukey: role::configcluster_stretch: enable etcd replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[10:14:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[10:16:17] <wikibugs>	 (03PS4) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[10:18:29] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[10:18:29] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 46.88 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:20:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[10:25:52] <wikibugs>	 (03CR) 10Vgutierrez: Detect when cert config changes and re-issue (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[10:25:57] <wikibugs>	 (03PS14) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[10:26:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:28:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "Nit inline, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:28:42] <wikibugs>	 (03CR) 10Banyek: wmf-pt-kill: WMF patched version 2 (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[10:29:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Repackaging for stretch [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/464507 (https://phabricator.wikimedia.org/T205814)
[10:30:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Repackaging for stretch [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/464507 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto)
[10:31:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, added a few WMCS folks for confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[10:32:54] <wikibugs>	 (03PS5) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[10:33:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "+1 but wait for the etcd-mirror package to be available." [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[10:33:48] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, and 2 others: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell)
[10:35:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] wmf-pt-kill: WMF patched version 2 (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[10:36:32] <_joe_>	 !log uploading etcd-mirror to stretch-wikimedia T205814
[10:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:37] <stashbot>	 T205814: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814
[10:36:58] <wikibugs>	 (03PS6) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[10:37:59] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-1] "also see PS36 comment" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[10:38:08] <godog>	 !log upload scap 3.8.7-1 - T204383
[10:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:12] <stashbot>	 T204383: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383
[10:38:56] <wikibugs>	 (03PS4) 10Filippo Giunchedi: Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (https://phabricator.wikimedia.org/T204383) (owner: 1020after4)
[10:39:04] <wikibugs>	 (03PS6) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931
[10:39:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (https://phabricator.wikimedia.org/T204383) (owner: 1020after4)
[10:41:59] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.92 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:43:22] <godog>	 I've silenced that alert for ulsfo, depooled
[10:44:17] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.23/README: noop sync to verify that scap 3.8.7-1 works (at least on a basic level) (duration: 00m 59s)
[10:44:18] <wikibugs>	 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10fgiunchedi)
[10:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:21] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, and 2 others: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done! 3.8.7-1 is live
[10:44:56] <twentyafterfour>	 Thanks for uploading the new version godog!
[10:45:07] <godog>	 np twentyafterfour 
[10:47:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek)
[10:53:05] <wikibugs>	 (03PS1) 10Sbisson: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149)
[11:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1100).
[11:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:01:43] <Urbanecm>	 Present
[11:02:47] <Urbanecm>	 zeljkof, will you swat?
[11:03:54] <zeljkof>	 o/
[11:03:58] <zeljkof>	 I can SWAT today
[11:04:02] <zeljkof>	 Urbanecm: yes! :D
[11:04:30] <Urbanecm>	 :D
[11:05:51] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21)
[11:07:43] <wikibugs>	 (03Merged) 10jenkins-bot: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21)
[11:08:08] <wikibugs>	 (03CR) 10jenkins-bot: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21)
[11:08:39] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:08:53] <zeljkof>	 Urbanecm: 463582 is at mwdebug2001
[11:10:04] <Urbanecm>	 zeljkof, are you sure it is on mwdebug2001?
[11:10:19] <Urbanecm>	 Ah, sorry
[11:10:40] <Urbanecm>	 Yeah, it is working, was checking in wrong way
[11:10:43] <Urbanecm>	 zeljkof ^^
[11:10:50] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:11:48] <wikibugs>	 (03PS39) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962)
[11:16:18] <zeljkof>	 Urbanecm: sorry, got distracted, deploying
[11:16:31] <Urbanecm>	 ok
[11:17:31] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463582|Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org (T205747)]] (duration: 00m 57s)
[11:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:36] <stashbot>	 T205747: Create new Photowalk namespace for bd.wikimedia.org - https://phabricator.wikimedia.org/T205747
[11:17:40] <zeljkof>	 Urbanecm: deployed ^
[11:17:56] <Urbanecm>	 thank you. Can you run namespaceDupes.php to be sure there's nothing inaccessible?
[11:18:32] <Urbanecm>	 zeljkof, ^
[11:18:33] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:18:41] <zeljkof>	 Urbanecm: sure
[11:18:45] <Urbanecm>	 thank you
[11:22:15] <zeljkof>	 Urbanecm: done T205747#4641406
[11:22:28] <Urbanecm>	 thx
[11:22:48] <wikibugs>	 (03CR) 10Zfilipin: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:22:55] <wikibugs>	 (03PS6) 10Zfilipin: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:23:12] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:25:50] <zeljkof>	 Urbanecm: 463584 is at mwdebug
[11:25:59] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:26:04] <Urbanecm>	 testing
[11:26:28] <wikibugs>	 (03PS2) 10Zfilipin: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm)
[11:26:37] <Urbanecm>	 zeljkof, working, please deploy
[11:26:44] <zeljkof>	 ok
[11:27:38] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463584|Change acewiki default time zone to Asia/Jakarta (T205693)]] (duration: 00m 56s)
[11:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:43] <stashbot>	 T205693: Change acewiki default time zone to Asia/Jakarta - https://phabricator.wikimedia.org/T205693
[11:27:50] <zeljkof>	 Urbanecm: deployed ^
[11:27:55] <Urbanecm>	 thx
[11:28:43] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm)
[11:30:19] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:30:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm)
[11:32:20] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:32:38] <zeljkof>	 Urbanecm: 463227 at mwdebug2001
[11:33:39] <Urbanecm>	 zeljkof, working, please deploy (and run namespaceDupes.php afterwards).-
[11:34:43] <zeljkof>	 ok
[11:35:49] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463227|Add some namespaces aliases for zhwikiversity (T201675)]] (duration: 00m 57s)
[11:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:54] <stashbot>	 T201675: Create new namespaces in zhwikiversity - https://phabricator.wikimedia.org/T201675
[11:35:58] <zeljkof>	 Urbanecm: deployed ^
[11:36:35] <Urbanecm>	 thank you. 
[11:38:03] <Urbanecm>	 hmm, 3 links to fix, 2 were resolvable.
[11:38:09] <Urbanecm>	 Noting, will investigate later
[11:38:19] <zeljkof>	 Urbanecm: yeah, one problem
[11:38:35] <wikibugs>	 (03PS2) 10Zfilipin: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm)
[11:38:50] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:39:05] <wikibugs>	 (03CR) 10jenkins-bot: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21)
[11:39:07] <wikibugs>	 (03CR) 10jenkins-bot: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm)
[11:39:19] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm)
[11:40:25] <Urbanecm>	 zeljkof, please push wgCopyUploadsDomains patches directly to prod, nothing to test for me. Thank you!
[11:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm)
[11:40:42] <zeljkof>	 Urbanecm: ok
[11:41:46] <zeljkof>	 Urbanecm: merge conflict for 457474
[11:41:57] <zeljkof>	 (not resolvable in gerrit)
[11:42:01] <Urbanecm>	 will fix zeljkof 
[11:42:13] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457469|Add .bollywoodhungama.in to wgCopyUploadsDomains (T203363)]] (duration: 00m 57s)
[11:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:17] <stashbot>	 T203363: Please add http://www.bollywoodhungama.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T203363
[11:42:56] <zeljkof>	 Urbanecm: 457469 deployed ^
[11:43:02] <Urbanecm>	 thx
[11:45:31] <zeljkof>	 Urbanecm: merge conflict also for 460700
[11:45:38] <Urbanecm>	 fixing both
[11:47:16] <wikibugs>	 (03PS3) 10Urbanecm: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219)
[11:47:20] <Urbanecm>	 ^^ zeljkof  ^^
[11:47:33] <zeljkof>	 Urbanecm: on it
[11:47:39] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:48:02] <wikibugs>	 (03PS2) 10Urbanecm: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371)
[11:48:07] <Urbanecm>	 and the second one ^^
[11:48:34] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm)
[11:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm)
[11:50:35] <zeljkof>	 Urbanecm: the last two patches should be deployed without mwdebug?
[11:50:39] <Urbanecm>	 yes
[11:51:36] <zeljkof>	 Urbanecm: still conflict for 460700
[11:51:41] <zeljkof>	 (the last one)
[11:51:53] <Urbanecm>	 ok, probably the previous patch caused another conflict, fixing
[11:52:02] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457474|add Radlines.org to $wgCopyUploadsDomains (T203219)]] (duration: 00m 57s)
[11:52:05] <zeljkof>	 yes, the last 3 patches seem to touch the same line
[11:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:07] <stashbot>	 T203219: Please add Radlines.org to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T203219
[11:52:23] <zeljkof>	 Urbanecm: 457474 deployed ^
[11:52:40] <wikibugs>	 (03PS3) 10Urbanecm: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371)
[11:52:42] <Urbanecm>	 fixed ^^
[11:53:43] <wikibugs>	 (03CR) 10jenkins-bot: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm)
[11:53:45] <wikibugs>	 (03CR) 10jenkins-bot: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm)
[11:53:58] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm)
[11:55:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm)
[11:57:20] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460700|Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons (T203371)]] (duration: 00m 56s)
[11:57:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:26] <stashbot>	 T203371: Please add nasimonline.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T203371
[11:57:29] <zeljkof>	 Urbanecm: all deployed!
[11:57:33] <Urbanecm>	 thank you! 
[11:57:40] <zeljkof>	 !log EU SWAT finished
[11:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:25] <Urbanecm>	 well, do you have time for one additional patch zeljkof? :D MW train is not deployed in its EU version, so I hope it would be possible :)
[11:58:31] <Urbanecm>	 its https://gerrit.wikimedia.org/r/464481
[11:59:06] <wikibugs>	 (03PS4) 10Zfilipin: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21)
[11:59:15] <zeljkof>	 Urbanecm: sure
[11:59:18] <Urbanecm>	 thank you
[11:59:59] <zeljkof>	 Urbanecm: just please update the calendar
[12:00:02] <Urbanecm>	 will do, thanks
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1200)
[12:00:07] <zeljkof>	 !log one more patch for EU SWAT
[12:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:11] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21)
[12:01:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991)
[12:01:52] <moritzm>	 !log rolling reboot of ms-fe hosts in codfw for kernel security update
[12:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:03:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991)
[12:03:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21)
[12:03:29] <addshore>	 are there any difference bwteen the eqiad and codfw mw redis lcok servers?
[12:03:33] <addshore>	 *lock
[12:04:11] <zeljkof>	 Urbanecm: 464481 at mwdebug
[12:04:40] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:04:43] <Urbanecm>	 zeljkof, working, please deploy
[12:06:19] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:464481|Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki (T205595)]] (duration: 00m 57s)
[12:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:23] <stashbot>	 T205595: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki - https://phabricator.wikimedia.org/T205595
[12:06:41] <zeljkof>	 Urbanecm: all deployed, please check and thanks for deploying with #releng! ;)
[12:06:47] <zeljkof>	 !log EU SWAT finished
[12:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:06] <Urbanecm>	 thank you zeljkof for deploying my and Zoranzoki21's patches!
[12:07:57] <zeljkof>	 Urbanecm: no problemo :D
[12:09:13] <wikibugs>	 (03CR) 10jenkins-bot: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm)
[12:09:15] <wikibugs>	 (03CR) 10jenkins-bot: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21)
[12:14:49] <wikibugs>	 (03CR) 10GTirloni: [C: 032] openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[12:18:34] <wikibugs>	 (03PS7) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814)
[12:22:49] <wikibugs>	 (03PS1) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509)
[12:23:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[12:23:56] <elukey>	 !log deploy etcdmirror on conf1005 - T205814
[12:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:01] <stashbot>	 T205814: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814
[12:24:40] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:24:48] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, great job Matt! One documentation nitpick inline, but feel free to merge as is." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[12:26:10] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:35:25] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:39:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[12:45:07] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[12:46:09] <wikibugs>	 (03PS12) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187)
[12:49:45] <wikibugs>	 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[12:49:54] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:50:38] <wikibugs>	 (03CR) 10Mathew.onipe: "Jenkins dry run: https://puppet-compiler.wmflabs.org/compiler1002/12770/wdqs1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[12:52:18] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) My current understanding of the issue:  All IRQs from NIC are handled by a single CPU. Under load, Blazegraph satur...
[12:52:38] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) a:03Gehel
[12:57:22] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[12:59:35] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:00:04] <jouncebot>	 Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1300)
[13:02:17] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Scap, and 2 others: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10mobrovac) >>! In T205981#4634221, @Gehel wrote: > I can confirm that @Mathew.onipe needs to be able to deploy wikidata query serv...
[13:02:26] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Add cumin aliases for each wdqs clusters - https://phabricator.wikimedia.org/T205542 (10Mathew.onipe) 05Open>03Resolved
[13:03:38] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron)
[13:03:45] <icinga-wm>	 PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[chown /etc/eventstreams/config.yaml],Package[electron-render/deploy],Exec[chown /srv/deployment/electron-render for deploy-service]
[13:04:27] <wikibugs>	 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[13:07:05] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:08:09] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "A few minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[13:08:14] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:09:16] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 9.002 ge 4 Muehlenhoff T200678 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops
[13:09:30] <wikibugs>	 (03PS4) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485
[13:11:32] <wikibugs>	 (03PS2) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509)
[13:11:47] <wikibugs>	 (03PS2) 10Ottomata: Install python 2 variant of sklearn on stat machines [puppet] - 10https://gerrit.wikimedia.org/r/464425 (owner: 10Gilles)
[13:12:03] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Install python 2 variant of sklearn on stat machines [puppet] - 10https://gerrit.wikimedia.org/r/464425 (owner: 10Gilles)
[13:14:05] <banyek>	 !log muting alerts on dbstore2002 and resuming compression of s2 database tables (T204930)
[13:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:12] <stashbot>	 T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930
[13:14:26] <banyek>	 !log muting alerts on s2replication @dbstore2002 and resuming compression of s2 database tables (T204930)
[13:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:48] <wikibugs>	 (03PS3) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509)
[13:15:08] <wikibugs>	 (03PS4) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509)
[13:19:42] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10elukey) p:05Triage>03Normal
[13:20:12] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10elukey)
[13:21:21] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Ottomata) I think `analytics-admins` is the right group let's keep using it!
[13:26:20] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey)
[13:29:05] <icinga-wm>	 RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[13:32:39] <wikibugs>	 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Ottomata) @HaeB do you still need this?  Can we roll this back?
[13:38:18] <wikibugs>	 (03CR) 10Elukey: [C: 032] Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey)
[13:40:10] <raynor>	 hey all, I'm deploying the Proton on production to the https://gerrit.wikimedia.org/r/#/c/mediawiki/services/chromium-render/deploy/+/464558/
[13:40:24] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10akosiaris) 05Resolved>03Open I followed http://erikimh.com/megacli-cheatsheet/ to do so  and  ``` megacli -PdReplaceMissing -PhysDrv [15:9] -Array0 -row9 -a0...
[13:41:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[python-sklearn]
[13:41:45] <elukey>	 fixing --^
[13:42:16] <wikibugs>	 (03PS1) 10Ottomata: Add Accept header to varnishkafka webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606)
[13:43:39] <logmsgbot>	 !log pmiazga@deploy1001 Started deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158)
[13:43:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:46] <stashbot>	 T201158: [4hrs] Have a Grafana dashboard for Proton - https://phabricator.wikimedia.org/T201158
[13:43:46] <stashbot>	 T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748
[13:44:38] <wikibugs>	 (03CR) 10Bstorm: "Going to get the initial actor table patch deployed fully before I merge this.  Also, I'll test it locally." [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[13:46:34] <icinga-wm>	 RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:46:35] <logmsgbot>	 !log pmiazga@deploy1001 Finished deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) (duration: 02m 55s)
[13:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:21] <icinga-wm>	 PROBLEM - Etcd replication lag on conf1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.002 second response time
[13:48:45] <_joe_>	 uhm
[13:48:51] <_joe_>	 elukey: didn't you disable notifications?
[13:48:55] <volans>	 what's up?
[13:49:02] <elukey>	 yeah but for an hour, it must have expired, my bad
[13:49:04] <_joe_>	 anyways, everyone, we're installing that
[13:49:08] <_joe_>	 server
[13:49:15] <apergos>	 thanks for the heads up
[13:49:48] <elukey>	 added downtime for 4 hours
[13:49:53] <volans>	 ah ok, so no worries :)
[13:54:49] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::database::meta: fix require for stretch [puppet] - 10https://gerrit.wikimedia.org/r/464568 (https://phabricator.wikimedia.org/T205509)
[13:55:08] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747)
[13:55:40] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::analytics::database::meta: fix require for stretch [puppet] - 10https://gerrit.wikimedia.org/r/464568 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey)
[13:59:14] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notification for check_elasticsearch_shard_size [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187)
[14:00:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga::monitor::elasticsearch: throttle alerts notification for check_elasticsearch_shard_size [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[14:01:29] <wikibugs>	 (03CR) 10Mathew.onipe: "@Filippo: the aim is to retry after 6 hours thrice before it finally throws an alert. Please confirm if this CR takes care of this." [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[14:03:24] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187)
[14:03:30] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10kostajh) @jcrespo yes, #growth-team is handling #structureddiscussions.  > Not doing this may soon block T106386...
[14:06:38] <wikibugs>	 (03CR) 10Ottomata: "Tested in deployment-prep, works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) (owner: 10Ottomata)
[14:09:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573
[14:09:50] <marostegui>	 !log Sanitize enwikivoyage cebwiki shwiki srwiki mgwiktionary on db1124:3315 T184805
[14:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:55] <stashbot>	 T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805
[14:15:16] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[14:16:28] <wikibugs>	 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[14:16:29] <icinga-wm>	 RECOVERY - Etcd replication lag on conf1005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.002 second response time
[14:16:53] <apergos>	 eyeroll
[14:17:06] <wikibugs>	 (03CR) 10Gehel: "Minor comment inline. I'd like Filippo to go over this to validate this does what I think it does (I'm sometimes confused by Icinga)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[14:17:44] <wikibugs>	 (03PS13) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187)
[14:19:32] <wikibugs>	 (03CR) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[14:29:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[14:29:31] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10greg) >>! In T205981#4641753, @Gehel wrote: > @greg it looks like we need your approval to a...
[14:30:05] <wikibugs>	 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) @kostajh I don't have a say on that, was just pointing we are waiting for someone to take a lead, and o...
[14:34:19] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10DarTar) Approved, thanks.
[14:34:36] <wikibugs>	 (03PS2) 10Marostegui: wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[14:39:53] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483
[14:39:55] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mathoid: Switch liveness probe into tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464504
[14:39:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Set the scaffolding's livenessProbe to tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464505
[14:45:31] <wikibugs>	 10Operations, 10cloud-services-team: WMCS: Fewer transitory middle-of-the-night puppet alerts - https://phabricator.wikimedia.org/T206224 (10Andrew)
[14:47:58] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: scaffold: Add some sample requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464578
[14:48:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Bump num_workers to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464579
[14:48:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Bump chart version to 0.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464580
[14:48:04] <wikibugs>	 (03PS2) 10Dduvall: Use sed instead of envsubst [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256
[14:48:51] <banyek>	 !log depooling labsb1010 (T195747)
[14:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:57] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[14:49:23] <wikibugs>	 (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm)
[14:49:30] <wikibugs>	 (03CR) 10Dduvall: "Thought I'd resurrect this patchset one more time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall)
[14:49:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall)
[14:49:44] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall)
[14:49:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall)
[14:50:20] <akosiaris>	 marostegui: merging yours as well
[14:50:27] <marostegui>	 akosiaris: mine?
[14:50:38] <akosiaris>	 Marostegui: wiki replicas: depool labsdb1010 to add initial actor table changes to views (cd7d11227b)
[14:50:40] <akosiaris>	 ?
[14:50:44] <akosiaris>	 good thing I pinged
[14:50:47] <marostegui>	 akosiaris: that is banyek
[14:50:55] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[14:51:06] <banyek>	 @akosaris shall I merge yours too?
[14:51:14] <banyek>	 akosaris: shall I merge yours too?
[14:51:18] <akosiaris>	 lol
[14:51:22] <akosiaris>	 yes
[14:51:30] <akosiaris>	 hm
[14:51:39] <akosiaris>	 so we list the uploader but not the commiter there
[14:51:47] <marostegui>	 akosiaris: I did the last rebase indeed
[14:51:49] <akosiaris>	 and on purpose now that I think about it
[14:52:07] <marostegui>	 akosiaris: It is fine, brooke sent the patch, i rebased it and banyek +2 it
[14:52:10] <marostegui>	 XD
[14:52:12] <akosiaris>	 yeah I saw
[14:52:30] <akosiaris>	 I was just wondering why I a different username but ok
[14:52:44] <akosiaris>	 I do even remember why we did it that way and not the other way around
[14:53:42] <wikibugs>	 (03PS2) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454)
[14:54:00] <wikibugs>	 (03PS2) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454)
[14:54:19] <wikibugs>	 (03PS2) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454)
[14:55:25] <wikibugs>	 (03PS4) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454)
[14:55:41] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) Let's please update docs: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest
[14:58:51] <wikibugs>	 (03PS1) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584
[14:59:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul)
[15:01:27] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "took the generated config from compiler, copied it to mwdebug1001 to replace wiktionary.conf. ran apache-fast-test from deploy1001 with th" [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[15:01:58] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Nuria) +1 to analytics-admins
[15:04:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "careful with the regex, lvs20010 != lvs2010 :)" [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul)
[15:08:25] <XioNoX>	 Hi, FYI, we're going to start the asw2-b-eqiad recabling work in ~1h, see https://phabricator.wikimedia.org/T201039 for the list of hosts impacted (and the email sent to ops@)
[15:09:55] <elukey>	 ack thanks :)
[15:13:08] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) The disk was a spare...i didn't even look to see that it was a SATA disk.  This server is out of warranty and we'll need to buy 4TB SAS disks
[15:13:49] <wikibugs>	 (03CR) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[15:18:44] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM, I will check with Stas before merging." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[15:19:40] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) Maybe it makes sense to prioritize T196478  instead?
[15:20:10] <wikibugs>	 (03CR) 10C. Scott Ananian: "The dependency went out in 1.32.0-wmf.23 last week, and so should be safe to merge today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[15:20:21] <wikibugs>	 (03PS5) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202
[15:20:50] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[15:20:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[15:23:22] <wikibugs>	 (03Merged) 10jenkins-bot: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[15:25:36] <wikibugs>	 (03PS2) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584
[15:25:41] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10akosiaris) >>! In T205364#4642331, @Dzahn wrote: > Maybe it makes sense to prioritize T196478  instead?  That's what we 've being down up to now more or less. But it doesn't look good either...
[15:26:13] <wikibugs>	 (03CR) 10jenkins-bot: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[15:26:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) 05Open>03stalled
[15:26:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul)
[15:26:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris)
[15:26:41] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) Since this relates to an access request that was already approved at this weeks SRE...
[15:27:06] <wikibugs>	 (03PS2) 10Herron: admin: add Matt(onimisionipe) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/463967 (https://phabricator.wikimedia.org/T205981) (owner: 10Mathew.onipe)
[15:27:44] <wikibugs>	 (03PS3) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454)
[15:27:53] <wikibugs>	 (03CR) 10Cwhite: [C: 032] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[15:28:03] <wikibugs>	 (03CR) 10Herron: [C: 032] admin: add Matt(onimisionipe) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/463967 (https://phabricator.wikimedia.org/T205981) (owner: 10Mathew.onipe)
[15:28:47] <wikibugs>	 (03CR) 10Vgutierrez: "Cool! add the missing space between Bug: and T196560 in the commit message and it's ready to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul)
[15:32:16] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) 05Open>03Resolved a:03herron Change has been merged and will propagate out acr...
[15:34:10] <wikibugs>	 (03PS3) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454)
[15:34:27] <wikibugs>	 (03PS4) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454)
[15:34:54] <wikibugs>	 (03PS5) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454)
[15:35:19] <wikibugs>	 (03PS2) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898)
[15:35:40] <wikibugs>	 (03PS3) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454)
[15:36:53] <wikibugs>	 (03PS3) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539)
[15:38:21] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, it might even work at first try :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[15:39:18] <wikibugs>	 (03CR) 10Herron: ircecho: restart service on change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron)
[15:41:46] <elukey>	 !log depool kafka1002 from eventbus as precautionary step for T201039
[15:41:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:51] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[15:43:46] <icinga-wm>	 PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:47:58] <wikibugs>	 (03PS3) 10Papaul: iPartman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560)
[15:49:18] <moritzm>	 ^ shdubsh: caused by your patch:
[15:49:32] <moritzm>	 Error while evaluating a Resource Statement, Duplicate declaration: Package[diamond] is already declared in file /etc/puppet/modules/standard/manifests/diamond.pp:23; cannot redeclare at /etc/puppet/modules/diamond/manifests/init.pp:69 at /etc/puppet/modules/diamond/manifests/init.pp:69:5 at /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp:32 on node etherpad1001.eqiad.wmnet
[15:49:35] <wikibugs>	 (03PS1) 10Cwhite: standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454)
[15:49:47] <vgutierrez>	 papaul: I think you accidentally added an "i" at the beginning of the commit message :(
[15:49:51] <wikibugs>	 (03PS3) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187)
[15:50:04] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, thanks for taking care of this." [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron)
[15:50:19] <shdubsh>	 moritzm: indeed.  hoping the latest patch will alleviate
[15:51:02] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444
[15:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:06] <stashbot>	 T199444: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444
[15:52:42] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444 (duration: 01m 40s)
[15:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:41] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444
[15:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:36] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444 (duration: 00m 55s)
[15:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:31] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cole_white https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464595/
[15:55:41] <bblack>	 jouncebot: next
[15:55:42] <jouncebot>	 In 0 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1600)
[15:55:47] <moritzm>	 shdubsh: that would end up in a probably a whack-a-mole of various occurrences of diamond::collector->absent, we probably need to look into a difference fix 
[15:56:23] <arturo>	 !log icinga downtime every server with the cloudXXXX scheme for 2h T201039
[15:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:27] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[15:58:33] <arturo>	 !log icinga downtime every server in the main cloudvps deployment for 2h T201039
[15:58:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1600). Please do the needful.
[16:00:04] <jouncebot>	 thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:11] <thcipriani>	 o/
[16:00:12] <marostegui>	 !log Stop MySQL on db1073 for mariadb and kernel upgrade - T201039 T148507
[16:00:15] <marostegui>	 arturo: ^
[16:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:21] <arturo>	 ack
[16:00:52] <arturo>	 kernel upgrade? I though this was just about the recabling?
[16:01:09] <marostegui>	 arturo: we are talking the window also to upgrade mysql move the socket to the correct path and upgrade kernel
[16:01:13] <marostegui>	 (and upgrade mysql)
[16:01:17] <marostegui>	 *taking the window
[16:01:27] <arturo>	 cool
[16:02:57] <marostegui>	 arturo: server rebooting now
[16:03:00] <wikibugs>	 (03PS4) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560)
[16:03:13] <jynus>	 it is important to have thos kernels fresh, arturo :-)
[16:03:24] <jynus>	 it is when they taste better1
[16:03:24] <arturo>	 sure
[16:03:46] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:03:55] <wikibugs>	 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Joe) The master DC is a variable, and while in production that's dynamically generated from etcd (more or less), in the compiler is a static value. That was a deliberate choice to decouple...
[16:04:07] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:04:14] <AndyRussG>	 wikitech is broke'd https://wikitech.wikimedia.org/
[16:04:16] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[16:04:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[16:04:37] <AndyRussG>	 Cannot access the database: Cannot access the database: Unknown error
[16:04:45] <arturo>	 AndyRussG: that's expected
[16:04:46] <wikibugs>	 (03PS5) 10Vgutierrez: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[16:04:51] <marostegui>	 AndyRussG: yes, there is maintenance going on
[16:05:12] <wikibugs>	 (03PS4) 10Cwhite: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:05:33] <marostegui>	 arturo: server up, starting mysql now
[16:05:50] <arturo>	 ack
[16:05:52] <jynus>	 just connected indeed
[16:06:22] <AndyRussG>	 marostegui: arturo ah okok thx! have fun :)
[16:06:38] <marostegui>	 arturo: everything is up
[16:06:41] <marostegui>	 we should be back
[16:06:41] <XioNoX>	 are the deploys done? (dunno how I can check it)
[16:06:45] <marostegui>	 socket also in the new location!
[16:06:49] <jynus>	 marostegui: do I reload the proxy?
[16:06:56] <marostegui>	 yes please
[16:07:09] <jynus>	 !log reloading haproxy @ dbproxy1005
[16:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:13] <twentyafterfour>	 XioNoX: there is nowhere to check really except asking in here like you did 
[16:07:22] <marostegui>	 I can edit wikitech finely
[16:07:25] <jynus>	 mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP
[16:07:26] <marostegui>	 AndyRussG: ^
[16:08:07] <marostegui>	 arturo: let me know if you see all good from your end
[16:08:07] <twentyafterfour>	 !log logged downtime for phabricator in icinga, stopped phd queue processing in preparation for read-only mode 
[16:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:49] <arturo>	 marostegui: apparently yes. I see openstack talking to the DB
[16:09:10] <marostegui>	 great
[16:09:24] <AndyRussG>	 marostegui: yeah looks good now :)
[16:09:30] <AndyRussG>	 thx!
[16:09:41] <marostegui>	 good to hear, thanks!
[16:11:10] <XioNoX>	 thcipriani: is the deployment done?
[16:11:40] <arturo>	 I understand no physical cable was unplugged yet, right XioNoX ?
[16:11:50] <XioNoX>	 arturo: correct
[16:11:52] <thcipriani>	 deployment? I wasn't deploying.
[16:11:54] <wikibugs>	 (03PS4) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539)
[16:12:04] <twentyafterfour>	 puppet swat
[16:12:16] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[16:12:17] <thcipriani>	 oh! no, nobody did puppet SWAT: should be a simple one.
[16:12:19] <wikibugs>	 (03PS1) 10Jcrespo: Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601
[16:12:28] <XioNoX>	 was mentioning "thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker."
[16:12:43] <elukey>	 let's do it after the network maintenance :)
[16:12:54] <XioNoX>	 ok then
[16:13:20] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "+10000!" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo)
[16:13:23] <thcipriani>	 XioNoX: sorry, I misunderstood you. "deployments" :)
[16:13:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo)
[16:13:40] <wikibugs>	 (03CR) 10Jcrespo: "With https://phabricator.wikimedia.org/T148507 closed, mariadb is no longer a blocker... although I cannot be sure for other services." [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo)
[16:13:51] <XioNoX>	 no worries, I didn't know how those things work
[16:13:57] <XioNoX>	 !log starting asw2-b-eqiad re-cabling - T201039
[16:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:02] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[16:14:21] <XioNoX>	 !log Enable all VC ports on FPC2 and FPC7 - T201039
[16:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:51] <twentyafterfour>	 should I go read-only now? 
[16:15:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks fine. We don't collect comparable metrics for timesyncd as we did for ISC ntpd (as timesyncd is far more minimalistic) and the ensur" [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:16:02] <twentyafterfour>	 !log phabricator is read-only
[16:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:24] <marostegui>	 !log Stop and reboot db1072 (phabricator master) for maintenance 
[16:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:41] <jynus>	 tarrow: marostegui thanks
[16:16:53] <jynus>	 sorry^ I meant twentyafterfour
[16:17:00] <marostegui>	 And not me? :_(
[16:17:01] <marostegui>	 :-)
[16:17:06] <arturo>	 xd
[16:17:12] <MatmaRex>	 phab ded? D:
[16:17:23] <arturo>	 MatmaRex: see SAL
[16:17:55] <MatmaRex>	 A Troublesome Encounter!
[16:17:55] <MatmaRex>	 Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL).
[16:18:11] <MatmaRex>	 oh, sometimes it loads, readonly indeed
[16:18:18] <twentyafterfour>	 MatmaRex: we're in read-only mode but some requests will still error out
[16:18:42] <wikibugs>	 (03CR) 10Herron: [C: 032] "you betcha!" [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron)
[16:19:56] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[16:20:28] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[16:20:31] <jynus>	 ^expected
[16:21:45] <jynus>	 !log reloading dbproxy1003,8
[16:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:57] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0
[16:22:07] <jynus>	 it says up on both
[16:22:17] <jynus>	 as the recovery testifys also
[16:22:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0
[16:24:27] <wikibugs>	 (03CR) 10Aaron Schulz: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[16:25:08] <twentyafterfour>	 !log phabricator is read-write
[16:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:32] <jynus>	 MatmaRex: note maintenance window has not finished and there may be more interruptions
[16:27:26] <wikibugs>	 (03CR) 10Cwhite: [C: 032] standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:27:33] <wikibugs>	 (03PS2) 10Cwhite: standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454)
[16:27:37] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[16:27:47] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[16:30:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602
[16:31:29] <wikibugs>	 (03CR) 10Elukey: [C: 031] Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 (owner: 10Giuseppe Lavagetto)
[16:33:32] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356 (10Marostegui)
[16:33:35] <wikibugs>	 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui)
[16:33:46] <twentyafterfour>	 !log started phd on phab1001 and re-enabled puppet (I had it disabled to prevent starting phd during read-only) 
[16:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:49] <wikibugs>	 10Operations, 10IRCecho, 10Patch-For-Review: Puppet doesn't restart ircecho when the code changes - https://phabricator.wikimedia.org/T205539 (10herron) 05Open>03Resolved a:03herron The above patch was merged (not sure why gerritbot didn't comment about that)  Resolving!
[16:34:16] <icinga-wm>	 RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[16:35:28] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10EBjune) > @EBjune as @Mathew.onipe manager, could you approve this request?  Approved, thanks!
[16:35:46] <icinga-wm>	 PROBLEM - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:36:46] <jynus>	 etherpad?
[16:37:14] <jynus>	 probably something else in there
[16:37:46] <jynus>	 diamond not found
[16:37:50] <herron>	 etherpad1001 puppet-agent[26407]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Package[diamond] is already declared in file /etc/puppet/modules/standard/manifests/diamond.pp:23; cannot redeclare at
[16:37:50] <herron>	 /etc/puppet/modules/diamond/manifests/init.pp:69 at /etc/puppet/modules/diamond/manifests/init.pp:69:5 at /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp:32 on node etherpad1001.eqiad.wmnet
[16:38:26] <jynus>	 was something deployed recently?
[16:38:35] <herron>	 shdubsh: think that’s related to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464595/ ?
[16:38:41] <jynus>	 oh, I see
[16:38:45] <jynus>	 duplicate require
[16:38:51] <jynus>	 and not require_package
[16:39:19] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Adding also @aaron to get his opinion, no idea about how to trace back what...
[16:39:42] <XioNoX>	 !log Enable fpc5-fpc7 - T201039
[16:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:47] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[16:39:48] <wikibugs>	 (03PS5) 10Cwhite: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:41:02] <shdubsh>	 herron, jynus: it should be recovered now
[16:41:22] <XioNoX>	 !log Connect/enable fpc2:0/51-fpc5:1/0 (5m DAC) - T201039
[16:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:29] <herron>	 alrighty, indeed looks like most recent puppet run was happy.  thanks shdubsh !
[16:42:25] <jynus>	 diamond may need a push, however
[16:42:48] <jynus>	 reset-failed or something (don't understand the context 100%)
[16:43:07] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. cole_white diamond is removed. looking into removing from systemd
[16:45:44] <shdubsh>	 !log etherpad1001 running systemctl reset-failed
[16:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:56] <icinga-wm>	 RECOVERY - Check systemd state on etherpad1001 is OK: OK - running: The system is fully operational
[16:47:35] <wikibugs>	 (03CR) 10Cwhite: "This latest changeset looks more happy.  https://puppet-compiler.wmflabs.org/compiler1002/12776/" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:52:00] <XioNoX>	 !log Step 3)  Add missing links - T201039
[16:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:04] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[16:53:50] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[16:54:56] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:58:10] <wikibugs>	 (03CR) 10Volans: "See inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[16:58:15] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:59:26] <icinga-wm>	 PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: connect to address checker.tools.wmflabs.org and port 80: No route to host
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1700).
[17:00:36] <icinga-wm>	 RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.012 second response time
[17:00:55] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:01:24] <wikibugs>	 (03PS1) 10Herron: admin: add isaacj to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840)
[17:02:28] <gtirloni>	 !log tools - published updated toollabs-* Docker images
[17:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:15] <icinga-wm>	 PROBLEM - Host analytics1061 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:16] <icinga-wm>	 PROBLEM - Host analytics1063 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:30] <elukey>	 hello
[17:03:35] <icinga-wm>	 PROBLEM - Host wtp1036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:36] <herron>	 everything ok?
[17:03:45] <elukey>	 network maintenance on row b
[17:03:46] <icinga-wm>	 PROBLEM - Host an-master1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:46] <icinga-wm>	 PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:46] <icinga-wm>	 PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:46] <icinga-wm>	 PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:48] <herron>	 ah of course
[17:03:50] <jynus>	 network maintenance
[17:04:06] <icinga-wm>	 PROBLEM - Host mwmaint1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:06] <icinga-wm>	 PROBLEM - Host ores1004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:06] <icinga-wm>	 PROBLEM - Host notebook1003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:06] <icinga-wm>	 PROBLEM - Host restbase-dev1005 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:16] <icinga-wm>	 PROBLEM - Host ms-be1041 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:16] <icinga-wm>	 PROBLEM - Host mw1313 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:16] <jynus>	 they may be back
[17:04:26] <icinga-wm>	 PROBLEM - Host mw1318 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:26] <icinga-wm>	 PROBLEM - Host mc1025 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host scb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host mw1290 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host thumbor1001 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host elastic1036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host mw1286 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host db1119 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:27] <icinga-wm>	 PROBLEM - Host db1113 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:28] <icinga-wm>	 RECOVERY - Host mw1313 is UP: PING WARNING - Packet loss = 44%, RTA = 2.71 ms
[17:04:28] <icinga-wm>	 RECOVERY - Host restbase-dev1005 is UP: PING WARNING - Packet loss = 44%, RTA = 1.16 ms
[17:04:36] <icinga-wm>	 PROBLEM - Host elastic1038 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:36] <icinga-wm>	 PROBLEM - Host kafka1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:36] <icinga-wm>	 PROBLEM - Host mw1287 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:36] <icinga-wm>	 PROBLEM - Host elastic1039 is DOWN: PING CRITICAL - Packet loss = 100%
[17:04:36] <icinga-wm>	 RECOVERY - Host ms-be1041 is UP: PING WARNING - Packet loss = 44%, RTA = 0.26 ms
[17:04:41] <elukey>	 XioNoX: ---^
[17:04:47] <jynus>	 he knows
[17:04:48] <icinga-wm>	 RECOVERY - Host elastic1036 is UP: PING WARNING - Packet loss = 80%, RTA = 0.35 ms
[17:04:53] <elukey>	 okok
[17:04:55] <icinga-wm>	 RECOVERY - Host mw1318 is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 58%, RTA = 0.58 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host scb1002 is UP: PING WARNING - Packet loss = 28%, RTA = 0.36 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host mw1287 is UP: PING WARNING - Packet loss = 28%, RTA = 0.34 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host thumbor1001 is UP: PING WARNING - Packet loss = 37%, RTA = 4.57 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host mw1286 is UP: PING WARNING - Packet loss = 28%, RTA = 0.31 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host elastic1038 is UP: PING WARNING - Packet loss = 37%, RTA = 0.31 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host elastic1039 is UP: PING WARNING - Packet loss = 37%, RTA = 0.40 ms
[17:04:56] <icinga-wm>	 RECOVERY - Host kafka1002 is UP: PING WARNING - Packet loss = 37%, RTA = 0.36 ms
[17:04:57] <icinga-wm>	 RECOVERY - Host mc1025 is UP: PING WARNING - Packet loss = 37%, RTA = 0.38 ms
[17:04:57] <icinga-wm>	 RECOVERY - Host mw1290 is UP: PING WARNING - Packet loss = 28%, RTA = 0.89 ms
[17:05:06] <icinga-wm>	 PROBLEM - HHVM rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:06] <icinga-wm>	 PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:06] <icinga-wm>	 PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:15] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:15] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed
[17:05:15] <icinga-wm>	 nse was received
[17:05:16] <icinga-wm>	 PROBLEM - SSH on elastic1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:16] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:16] <icinga-wm>	 PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:17] <icinga-wm>	 PROBLEM - Nginx local proxy to jobrunner on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[17:05:25] <icinga-wm>	 PROBLEM - Nginx local proxy to jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:25] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[17:05:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:27] <icinga-wm>	 PROBLEM - SSH on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:35] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:05:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[17:05:37] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:05:38] <icinga-wm>	 PROBLEM - Host mw1314 is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:52] <herron>	 is this useful to anyone or shall I disable ircecho in favor of unhandld issues dashboard in icinga
[17:05:55] <icinga-wm>	 PROBLEM - Host mw1304 is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:56] <icinga-wm>	 PROBLEM - Host mw1302 is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:56] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=tr
[17:05:56] <icinga-wm>	 re a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was 
[17:05:56] <icinga-wm>	 /v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received
[17:05:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:05:57] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:06:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:06:05] <icinga-wm>	 PROBLEM - Host mc1027 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:05] <icinga-wm>	 PROBLEM - Host mw1285 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:05] <icinga-wm>	 PROBLEM - Host mw1301 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:06] <icinga-wm>	 PROBLEM - Host cp1081 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:06] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:06] <icinga-wm>	 PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:15] <icinga-wm>	 PROBLEM - Host mw1296 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:15] <icinga-wm>	 PROBLEM - Host mw1288 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:16] <icinga-wm>	 PROBLEM - Host mw1306 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:16] <icinga-wm>	 PROBLEM - Host thumbor1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:16] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[17:06:16] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:17] <icinga-wm>	 PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:18] <icinga-wm>	 PROBLEM - Host cp1082 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[17:06:26] <icinga-wm>	 PROBLEM - Host wtp1033 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:27] <icinga-wm>	 PROBLEM - Host druid1005 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:35] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[17:06:37] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron)
[17:06:38] <Krenair>	 are mobileapps alerts supposed to be going off?
[17:06:49] <Krenair>	 and in codfw?
[17:06:50] <elukey>	 !log stop ircecho on einstenium - alarms shower
[17:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:58] <bblack>	 everything talks to everything, often in mysterious ways
[17:08:21] <Krenair>	 :S
[17:08:23] <bblack>	 I'd assume the codfw mobileapps is related to the row B network stuff in eqiad, but it might be hard to track down the causal change
[17:08:34] <bblack>	 s/change/chain/
[17:09:50] <elukey>	 so it might be related to aqs having troubles
[17:10:11] <elukey>	 (the mobile apps alarms)
[17:13:09] <librenms-wmf>	 08Warning Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Inbound interface errors
[17:13:35] <elukey>	 yeah I think it is the chain mobileapps -> aqs -> druid
[17:15:46] <_joe_>	 elukey: why does mobileapps read from aqs?
[17:15:51] <andrewbogott>	 !log triggering some alerts on labvirt1018 to figure out about alert thresholds
[17:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:55] <_joe_>	 if that's the case, aqs needs to be multi-dc
[17:16:29] <bblack>	 could be worse, there's only 3 services in that chain :)
[17:16:42] <_joe_>	 can I ask if we know when the maintenance will be over?
[17:16:48] <bblack>	 we don't yet know
[17:16:51] <_joe_>	 or if there is any update?
[17:17:10] <bblack>	 the low-level traffic on the maintenance is in -dcops, aside from !log entries here
[17:17:26] <bblack>	 there's still cabling work ongoing
[17:17:46] <elukey>	 _joe_ I think it is for some metrics, but I have no control on the clients of course
[17:17:48] <XioNoX>	 should be soon if things are stable
[17:17:54] <elukey>	 I recall that we had as similar problem a while ago
[17:18:14] <_joe_>	 XioNoX: ok, thanks
[17:18:33] <bblack>	 the original window statement from the ticket was:
[17:18:36] <bblack>	 the new asw2-b-eqiad that will be impacted by Thursday 4th 16:00UTC 2h maintenance window (with a worse case of a 30min downtime for those hosts, and a best case of no impact).
[17:19:29] <bblack>	 so from that pov, we've got ~41 minutes left on the maint window, and so far affected hosts have been impacted for 16 of the 30 mins
[17:19:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602
[17:19:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464608
[17:20:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] New upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464608 (owner: 10Giuseppe Lavagetto)
[17:21:13] <XioNoX>	 troubleshooting one last link not coming up
[17:21:51] <bblack>	 we still have FPC6 in some disconnected state, that might not be related to the bad 2-8 link right?
[17:22:00] <XioNoX>	 bblack: do we?
[17:22:12] <elukey>	 !log re-enable ircecho after alarms shower
[17:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:15] <XioNoX>	 I think FPC6 looks healthy
[17:22:23] <bblack>	 I assume since there were host down alerts above that never recovered yet
[17:22:42] <bblack>	 oh, icinga-wm floodquit or whatever, so irc log isn't reliable
[17:23:19] <XioNoX>	 yeah, the FPC disconnect were brief
[17:23:37] <elukey>	 bblack: I stopped ircecho to avoid the shower of alarms and be able to talk in here
[17:24:10] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Inbound interface errors
[17:24:55] <_joe_>	 elukey: bad idea tbh
[17:24:56] <icinga-wm>	 PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:25:08] <_joe_>	 it made me think the mc servers were down for half an hour
[17:25:50] <elukey>	 _joe_ I will not do it again, but to me it is pointless to keep seeing alerts in here and not have a place to talk
[17:25:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 35, unassigned_shards: 1268, number_of_pending_tasks: 57, number_of_in_flight_fetch: 35, timed_out: False, active_primary_shards: 3178, task_max_waiting_in_queue_millis: 1308459, cluster_name: production-search-eqiad, relocating_shards: 3, active_shards_percent_as_
[17:25:56] <elukey>	 we have the icinga ui to check
[17:25:56] <icinga-wm>	 58, active_shards: 8241, initializing_shards: 39, number_of_data_nodes: 35, delayed_unassigned_shards: 0
[17:26:07] <_joe_>	 I get it, but I didn't notice it
[17:26:14] <elukey>	 sure sure
[17:26:16] <_joe_>	 since we usually don't do it
[17:26:39] <elukey>	 I had a different worflow/impression (already did it other times), will not do it again :)
[17:27:24] <_joe_>	 you can do it, but maybe do it *before* the shower of alerts happen
[17:27:30] <bblack>	 well the not-really-followed plan was to use this channel more for logs + spam, and move true conversation elsewhere, but it hasn't really materialized in practice
[17:27:37] <_joe_>	 or people reading the backlog will have a heart attack :P
[17:28:06] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:28:12] <arlolra>	 is this a bad time for me to be deploying parsoid?
[17:28:25] <icinga-wm>	 RECOVERY - AQS root url on aqs1008 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.018 second response time
[17:28:34] <_joe_>	 arlolra: I fear it is, there is a network mainentenance ongoing in eqiad
[17:28:46] <arlolra>	 ok, will postpone
[17:28:47] <arlolra>	 thanks
[17:28:49] <_joe_>	 your deploy might fail in unpredictable, not-so-funny ways
[17:28:53] <bblack>	 I think we have light at the end of the tunnel, but not ready to declare it all-ok just yet
[17:29:11] <bblack>	 we *think* the network has been stable now for a while, and we think there's no more physical changes to make right now
[17:29:18] <subbu>	 arlolra, ok .. i guess week after switchover.
[17:29:24] <_joe_>	 arlolra: if you can wait some minutes, maybe bblack & co might give you a green light
[17:29:50] <arlolra>	 it's not urgent, better to wait
[17:29:52] <XioNoX>	 is anything outstanding in term of row B alerts?
[17:30:50] <jynus>	 some puppet errors that don't let see any real issue
[17:30:54] <elukey>	 nothing horrible that I can see in icinga
[17:30:56] <icinga-wm>	 RECOVERY - puppet last run on db1113 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:30:56] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops
[17:31:01] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad Banyek ACK https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops
[17:32:04] <jynus>	 lvs1020: "Servers phab1001-vcs.eqiad.wmnet are marked down but pooled" does that need any action?
[17:32:25] <_joe_>	 lvs1002
[17:32:35] <jynus>	 that one
[17:32:37] <bblack>	 I was going to say, woah that's way more lvses than I remember :)
[17:32:41] <_joe_>	 abd tes
[17:32:50] <_joe_>	 *and yes, that's a real alert
[17:32:56] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:33:08] <_joe_>	 ipv6 to port 22 there seems unreachable from lvs1002
[17:33:12] <_joe_>	 but not 1005
[17:33:45] <_joe_>	 bblack: should we talk in -dcops?
[17:34:33] <elukey>	 !log pool kafka1002 (eventbus) after maintenance
[17:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:25] <wikibugs>	 (03PS39) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[17:35:27] <wikibugs>	 (03PS10) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885)
[17:35:56] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:35:56] <icinga-wm>	 RECOVERY - puppet last run on mc1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:35:56] <icinga-wm>	 RECOVERY - puppet last run on kubestage1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:35:56] <icinga-wm>	 RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:35:57] <icinga-wm>	 RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:35:57] <icinga-wm>	 RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:35:58] <icinga-wm>	 RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:36:05] <wikibugs>	 (03CR) 10Volans: "See inline, also there are a bunch of comments to a previous PS that are still valid and un-answered." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[17:37:04] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[17:37:49] <XioNoX>	 alright, the network work is done
[17:37:58] <XioNoX>	 let me know if there is any outstanding issue
[17:38:17] <XioNoX>	 but the stack has been stable for a bit of time now
[17:38:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[17:38:44] <XioNoX>	 !log asw2-b-eqiad recabling done - T201039
[17:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:48] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[17:38:56] <jynus>	 XioNoX: did you see _joe_'s comment?
[17:39:06] <jynus>	 not sure if that is network or traffic issue
[17:39:07] <wikibugs>	 (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610
[17:39:27] <bblack>	 yeah I'm not sure either
[17:39:50] <bblack>	 but it seems odd it would be a network error and affect phab1001-vcs and not phab1001
[17:40:22] <XioNoX>	 what's the source/dest of the issue? lvs1002 to ?
[17:40:23] <jynus>	 hope you can handle it, I was going to disconnect
[17:40:38] <bblack>	 yes
[17:40:38] <jynus>	 twentyafterfour may be around althoug that doesn't look service related
[17:40:46] <bblack>	 XioNoX: lvs1002 <-> phab1001-vcs
[17:40:56] <icinga-wm>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:56] <icinga-wm>	 RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:40:57] <icinga-wm>	 RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:41:08] <_joe_>	 !log uploaded new python-etcd packages for jessie, stretch
[17:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:25] <bblack>	 phab1001 is row B
[17:41:39] <bblack>	 but the same machine + same interface is both the phab1001.eqiad.wmnet and phab1001-vcs.eqiad.wmnet IPs
[17:42:05] <bblack>	 and I can ping both of those IPs from lvs1002
[17:43:17] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad Banyek T206245 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops
[17:45:27] <wikibugs>	 (03PS14) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[17:45:57] <wikibugs>	 (03CR) 10Aezell: [C: 031] Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem)
[17:46:11] <bblack>	 oh right, I was being confused by the ipv6 part
[17:46:13] <wikibugs>	 (03CR) 10Aezell: [C: 031] Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem)
[17:46:19] <bblack>	 can't ping phab1001 or phab1001-vcs ipv6 from lvs1002
[17:46:23] <XioNoX>	 still looking at it
[17:46:31] <XioNoX>	 IPs are properly configured on both sides
[17:46:36] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[17:47:02] <banyek>	 !log repooling labsb1010 (T195747)
[17:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:08] <bblack>	 yes, and reachable from lvs1005
[17:47:09] <stashbot>	 T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747
[17:48:29] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm)
[17:48:53] <wikibugs>	 (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm)
[17:49:07] <XioNoX>	 that looks similar to the VC fabric "miss-programming"
[17:49:08] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm)
[17:49:16] <XioNoX>	 that started https://phabricator.wikimedia.org/T201039
[17:49:18] <bblack>	 so on the lvs1002 side of this
[17:49:57] <bblack>	 the ipv6 for private1-b is 2620:0:861:102:1a03:73ff:fef0:8ede on eth1.1018@eth1 with mac 18:03:73:f0:8e:de
[17:50:15] <bblack>	 phab1001 seems like it has ok ipv6 to other places, the error may be with the lvs1002 row b interface
[17:50:24] <bblack>	 hard to say though
[17:50:44] <icinga-wm>	 PROBLEM - Etcd replication lag on conf1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.002 second response time
[17:50:52] <bblack>	 phab1001 neighbor table has a stale entry:
[17:50:57] <bblack>	 fe80::1a03:73ff:fef0:8ede dev eth0 lladdr 18:03:73:f0:8e:de STALE
[17:51:02] <apergos>	 ohrilly
[17:51:05] <bblack>	 for the fe80
[17:51:11] <wikibugs>	 (03PS1) 10Gehel: Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613
[17:51:28] <bblack>	 oh and this:
[17:51:30] <bblack>	 2620:0:861:102:1a03:73ff:fef0:8ede dev eth0  FAILED
[17:51:35] <XioNoX>	 lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:100 eth1.1018
[17:51:35] <XioNoX>	 Soliciting 2620:0:861:102:10:64:16:100 (2620:0:861:102:10:64:16:100) on eth1.1018...
[17:51:35] <XioNoX>	 Timed out.
[17:51:37] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:51:45] <apergos>	 elukey: do we need to worry about that alert or still no?
[17:52:29] <godog>	 indeed, the etcd replication paged, known already?
[17:52:31] <wikibugs>	 (03CR) 10Gehel: [C: 032] Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 (owner: 10Gehel)
[17:52:35] <wikibugs>	 (03PS40) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[17:52:37] <wikibugs>	 (03PS11) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885)
[17:52:50] <bblack>	 lvs1002 has like 8 different FAILED neighor entries
[17:52:50] <XioNoX>	 bblack: the fastest fix might be to bounce the network port of one of the servers
[17:52:56] <apergos>	 it paged earlier and the ack was extended
[17:53:02] <bblack>	 let's debug a little first since it isn't super critical
[17:53:10] <bblack>	 maybe there's a software level solution to this, too
[17:53:16] <wikibugs>	 (03PS2) 10Gehel: Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613
[17:53:20] <apergos>	 the server was being installed
[17:53:25] <apergos>	 maybe it is not yet installed?
[17:53:41] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 (owner: 10Gehel)
[17:53:46] <bblack>	 I'm guessing we don't have an easy way to target all hardware in row B via cumin right?
[17:54:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[17:54:27] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.000 second response time
[17:54:28] <godog>	 we should, lldp neighbors are a puppet fact IIRC
[17:54:35] <volans>	 bblack: almost
[17:54:37] <volans>	 give me a sec
[17:55:27] <volans>	 bblack: do you have a command I should run?
[17:55:58] <bblack>	 volans: I'm slogging through now, just targetting everything (which is probably better anyways)
[17:56:13] <icinga-wm>	 RECOVERY - Etcd replication lag on conf1005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.002 second response time
[17:56:38] <volans>	 I cheated btw, I used netbox to select all rowB, export the result (is a csv) selected the column with hostname
[17:56:43] <volans>	 and was ready to use that as a target list
[17:56:44] <bblack>	 I'm not even sure what FAILED means in the ipv6 neighbor table, but I wonder if it's a state that can happen due to a flap, but then gets stuck on the Linux side and just needs clearing
[17:56:56] <XioNoX>	 `phab1001:~$ sudo tcpdump -p "host 2620:0:861:102:10:64:16:100 or host fe80::1a03:73ff:fef0:8ede"`
[17:56:56] <XioNoX>	 `lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:100 eth1.1018`
[17:56:56] <XioNoX>	 bblack ^ phab1001 doesn't see the ND request
[17:57:05] <_joe_>	 ok I am inclined to disable notifications from conf1005
[17:57:15] <apergos>	 ah there you are
[17:57:32] <apergos>	 I was just about to speculate wildly about that host
[17:57:34] <volans>	 _joe_: +1 for me if you'll re-enable them later
[17:58:28] <icinga-wm>	 PROBLEM - puppet last run on labvirt1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:59:04] <volans>	 bblack: in case you need it, this is the search on Netbox (servers in rowB eqiad):
[17:59:07] <volans>	 https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=6&role=server
[18:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1800).
[18:00:05] <jouncebot>	 stephanebisson, jynus, and cscott: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:54] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2204 is CRITICAL: CRITICAL - load average: 89.84, 36.05, 20.00
[18:00:57] <bblack>	 I'm still staring/debugging on the phab1001-vcs issue
[18:01:03] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:01:04] <icinga-wm>	 PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hive_schematool_initialize_schema]
[18:01:54] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw2204 is OK: OK - load average: 33.09, 29.82, 18.93
[18:01:58] <stephanebisson>	 Hi
[18:03:35] <icinga-wm>	 RECOVERY - puppet last run on labvirt1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:05:56] <bblack>	 XioNoX: it half-recovered now....
[18:06:37] <XioNoX>	 state INCOMPLETE now?
[18:06:41] <bblack>	 well
[18:06:56] <bblack>	 I can do :8 now, which is the phab1001 ipv65
[18:06:58] <bblack>	 err
[18:06:59] <bblack>	 I can do :8 now, which is the phab1001 ipv6
[18:07:11] <bblack>	 but can't ping :100 which is the phab1001-vcs ipv6
[18:07:16] <bblack>	 it's all the same interfaces
[18:07:34] <bblack>	 before, ipv4 was working but ipv6 wasn't.  Now ipv4 and 1/2 ipv6 addrs work, but other ipv6 fails
[18:07:38] <_joe_>	 !log disabled notifications for etcd replication lag on conf1005, not in production
[18:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:50] <bblack>	 is the juniper buggy port thing related to cam tables or whatever juniper calls them?
[18:09:01] <bblack>	 (switching macaddr tables)
[18:09:13] <XioNoX>	 bblack: 2620:0:861:102:10:64:16:100 dev eth1.1018 lladdr 14:18:77:5b:23:4c DELAY
[18:09:30] <bblack>	 that's on lvs1002?
[18:09:42] <XioNoX>	 bblack: so pinging lvs from phab1001-vcs IP, forces lvs to update its cache
[18:09:46] <XioNoX>	 neighbor table
[18:09:51] <bblack>	 I did that though
[18:10:16] <bblack>	 root@phab1001:~# ping6 -n -I 2620:0:861:102:10:64:16:100 2620:0:861:102:1a03:73ff:fef0:8ede
[18:10:25] <XioNoX>	 but if lvs can't send the ND broadcast, then it will expire afer some time
[18:10:26] <bblack>	 ^ was my ping from the phab1001-vcs IP to lvs1002
[18:10:31] <XioNoX>	 yeah that's why I did too
[18:10:32] <bblack>	 I ran that many times
[18:10:34] <XioNoX>	 what*
[18:10:35] <XioNoX>	 ah
[18:10:44] <stephanebisson>	 bblack, XioNoX: Is the thing you are working on blocking SWAT?
[18:10:49] <bblack>	 stephanebisson: no
[18:11:26] <bblack>	 XioNoX: I'm inclined to think this particular issue (and maybe others like it?) are more of a bad linux software reaction to a network blip than anything else
[18:11:28] <XioNoX>	 bblack: and now on lvs1002 "2620:0:861:102:10:64:16:100 dev eth1.1018 lladdr 14:18:77:5b:23:4c REACHABLE"
[18:11:34] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy
[18:11:38] <XioNoX>	 fair enough
[18:11:39] <bblack>	 (bad reaction in the sense of bad neighbor table states, etc)
[18:11:53] <stephanebisson>	 I guess I'll do the SWAT
[18:12:02] <bblack>	 in any case, it did just eventually get fixed with no switch-side work, so that says a lot
[18:12:04] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:12:29] <bblack>	 bouncing the switch port might've fixed it too, but probably indirectly by flushing out everything related on the host's software interface stuff
[18:13:03] <XioNoX>	 bblack: I'm still confused on why ND doesn't work for that host
[18:13:16] <bblack>	 yeah that's what I'm saying
[18:13:41] <wikibugs>	 (03PS2) 10Sbisson: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149)
[18:14:07] <bblack>	 I don't know why ND didn't work, but <something> (perhaps a race condition / bug somewhere) went wrong when parts of asw2-b were flapping, and one or both sides of this pair of host got into a confused state on their ipv6 reachability to each other.
[18:14:22] <bblack>	 but v4 was working the whole time
[18:14:28] <XioNoX>	 lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:8 eth1.1018  <- timeout
[18:14:28] <XioNoX>	 lvs1002:~$ ndisc6 2620:0:861:4:208:80:155:108 eth3.1004  <- works
[18:14:29] <bblack>	 so it's not like the interface wasn't passing eth packets
[18:14:32] <XioNoX>	 ND still doesn't work
[18:14:36] <XioNoX>	 afaik
[18:14:39] <XioNoX>	 for that host
[18:15:27] <bblack>	 either the switch is interfering specifically at that level (with v6 discovery traffic / port-switching stuff), or it's a host-side problem
[18:15:41] <bblack>	 but ethernet traffic does flow between these macaddrs, for ipv4
[18:16:03] <icinga-wm>	 PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[18:16:19] <XioNoX>	 yeah, it's very specific to ND/multicast to at least some destination
[18:16:23] <icinga-wm>	 PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[18:16:45] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747)
[18:16:50] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson)
[18:18:14] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:18:19] <wikibugs>	 (03Merged) 10jenkins-bot: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson)
[18:18:59] <bblack>	 yeah
[18:19:11] <bblack>	 actually almost all ipv6 ND from lvs1002 -> row B seems bad
[18:19:21] <bblack>	 public vlan too
[18:20:44] <bblack>	 XioNoX: how about I stop pybal on lvs1002 (failover to 1005), and then we can try bouncing lvs1002:eth1?
[18:20:51] <XioNoX>	 bblack: sounds good
[18:20:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled
[18:21:12] <bblack>	 !log lvs1002: puppet disabled, stopping pybal (fail to 1005)
[18:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:50] <bblack>	 going to try software first
[18:21:54] <XioNoX>	 ok
[18:22:33] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:23:25] <logmsgbot>	 !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:464510|Enable PageTriage/ORES on enwiki (T206149)]] (duration: 01m 01s)
[18:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:30] <stashbot>	 T206149: Enable ORES in PageTriage in production - https://phabricator.wikimedia.org/T206149
[18:23:32] <bblack>	 yeah no fix
[18:23:37] <bblack>	 XioNoX: try switch?
[18:23:44] <icinga-wm>	 PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[18:23:50] <XioNoX>	 ok
[18:24:03] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops
[18:24:27] <XioNoX>	 !log bounce lvs1002:eth1 switch port
[18:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:28] <wikibugs>	 (03PS41) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[18:24:30] <wikibugs>	 (03PS12) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885)
[18:24:34] <stephanebisson>	 cscott: Are you around?
[18:25:17] <XioNoX>	 bblack: back on
[18:25:33] <bblack>	 yeah no help
[18:25:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1002 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=10)
[18:25:53] <wikibugs>	 (03CR) 10jenkins-bot: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson)
[18:26:14] <XioNoX>	 weird
[18:27:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[18:27:15] <bblack>	 XioNoX: we could try 1001 if we keep the flap quick?
[18:27:22] <bblack>	 (err, phab1001)
[18:27:36] <bblack>	 I donno
[18:27:56] <bblack>	 I guess lots of integreations there for the deployment that's ongoing too though
[18:28:01] <stephanebisson>	 cscott: I can deploy your patch if you become available in the next 15 minutes. Just let me know.
[18:28:10] <wikibugs>	 (03Abandoned) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox)
[18:28:17] <XioNoX>	 bblack: we can probably leave it as it for now
[18:28:17] <cscott>	 stephanebisson: i'm around!
[18:28:28] <cscott>	 stephanebisson: sorry, lost track of swat time
[18:28:32] <stephanebisson>	 cscott: ok, let's do it
[18:28:43] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:28:51] <bblack>	 XioNoX: yeah I'm not sure as to the impact really
[18:28:53] <wikibugs>	 (03CR) 10Sbisson: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:28:55] <bblack>	 minor to be sure
[18:28:59] <wikibugs>	 (03PS6) 10Sbisson: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:29:07] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:30:21] <bblack>	 it's going to impact anyone connection to git-ssh.wikimedia.org with v6, but maybe most would fall back to v4 (or even start there)
[18:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:31:48] <XioNoX>	 bblack: but it's currently working, right?
[18:31:54] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zuul]
[18:32:02] <bblack>	 XioNoX: I doubt it!
[18:32:06] <bblack>	 but maybe?
[18:32:44] <bblack>	 oh, it is
[18:32:52] <stephanebisson>	 cscott: You change is on mwdebug2001. Can you test?
[18:32:58] <stephanebisson>	 your*
[18:33:00] <bblack>	 I guess, even with ND borked, it's still routing traffic based on the arp of the ipv4
[18:33:14] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:33:29] <cscott>	 ok.  can you remind me how to pin my request to a particular mwdebug server?
[18:33:33] <bblack>	 although I'm not sure I understand how
[18:33:36] <bblack>	 ipvsadm says:
[18:33:38] <bblack>	 TCP  [2620:0:861:ed1a::3:16]:22 wrr -> [2620:0:861:102:10:64:16:100]:22 Route   10     0          0 
[18:34:42] <stephanebisson>	 cscott: There's a browser extension called "X-Wikimedia-debug". When you click on it you can select the server (mwdebug2001) and turn it ON.
[18:35:42] <bblack>	 XioNoX: just file a task, note that public git-ssh over v4 + v6 still appear to be working, ack the alert for now.  And we can try bouncing phab1001 port later when there's no deploy windows or anything else going on.
[18:36:09] <XioNoX>	 sounds good, what's the alert?
[18:37:05] <XioNoX>	 ah, found it
[18:37:07] <XioNoX>	 "Servers phab1001-vcs.eqiad.wmnet are marked down but pooled"
[18:37:08] <bblack>	 the 3x criticals showing on lvs1002
[18:37:26] <bblack>	 err wait, those are mine
[18:37:41] <bblack>	 that's probably why this is working heh
[18:37:42] <XioNoX>	 yeah, from pybal
[18:37:45] <XioNoX>	 :)
[18:38:05] <bblack>	 will leave things disabled/stopped on 1002 for now then as part of it
[18:38:43] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zuul]
[18:40:27] <cscott>	 stephanebisson: ok, confirmed that mwdebug2001 is still using Preprocessor_Hash in production (as it should be)
[18:40:41] <cscott>	 we don't have any machine running php7 yet in prod, do we?
[18:41:37] <wikibugs>	 (03CR) 10jenkins-bot: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:43:30] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, let's just wait next week after the switchover to merge as we'll decom the remaining jessie hosts where this should go that would no" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[18:44:53] <wikibugs>	 (03CR) 10C. Scott Ananian: "Deployed and tested on mwdebug2001:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian)
[18:45:23] <wikibugs>	 (03PS17) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183)
[18:46:10] <wikibugs>	 (03PS8) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183)
[18:46:42] <stephanebisson>	 cscott: re:php7,  I don't know :(
[18:46:52] <James_F>	 cscott: No.
[18:46:58] <cscott>	 pretty sure we don't.
[18:47:25] <James_F>	 cscott: Well, yes, but not for user-facing duties. The dump runners are running PHP7 and so is at least one cron-based task (l10nupdate)
[18:47:26] <cscott>	 so i can't test that we use Preprocessor_DOM on PHP7 machines because there aren't any yet.  so i think that's all good.
[18:47:48] <apergos>	 I have machines running php7
[18:47:53] <apergos>	 but that doesn't help you any
[18:47:58] <apergos>	 they are not set up as app servers
[18:48:08] <stephanebisson>	 cscott: going live...
[18:48:15] <apergos>	 they have mediawiki on them but they are closer to being like maintenance servrs
[18:48:32] <James_F>	 Maybe we should add in a php7-flavoured debug server (mwdebug2003php7omg or whatever).
[18:48:43] <cscott>	 well, the curl command you'd use to see whether you're running preprocessor_DOM or preprocessor_hash is in the last comment on https://gerrit.wikimedia.org/r/460202 if you wanted to play around on your own
[18:49:01] <apergos>	 it should get 0 traffic right now if we do that... I think testing in beta is the better way to go for now
[18:49:03] <cscott>	 James_F: yeah, that's really what i was wondering if we had
[18:49:20] <James_F>	 Not yet.
[18:49:35] <XioNoX>	 bblack: I updated https://phabricator.wikimedia.org/T201039#4643336 we can keep the same task to T/S the lvs/phab issue, would later today be a good time to do that switch port bounce?
[18:49:48] <cscott>	 stephanebisson: let me know when that's done and I'll perform one more test w/o x-mediawiki-debug
[18:49:57] <logmsgbot>	 !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460202|]] (duration: 00m 59s)
[18:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:31] <stephanebisson>	 cscott: all done
[18:51:02] <cscott>	 still getting <equals>=</equals> in the output aka Preprocessor_Hash as we expect.  all good.
[18:55:44] <icinga-wm>	 PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[18:55:46] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206254
[18:56:02] <bblack>	 XioNoX: not sure. checking cal
[18:56:43] <bblack>	 XioNoX: https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_October_04
[18:57:02] <bblack>	 so basically, the next opening is ~2h window from 21-23 UTC
[18:57:18] <XioNoX>	 ok!
[18:57:21] <bblack>	 assuming the stuff before finishes on time, we can declare a short "ok phab things might bounce" and bounce the port
[18:59:01] <wikibugs>	 (03CR) 10BBlack: [C: 031] "LGTM?" [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) (owner: 10Ottomata)
[19:00:04] <jouncebot>	 marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1900).
[19:03:43] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[19:08:03] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[19:14:12] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@6dc89c0]: Bump cirrusSearchLinksUpdate concurrency to 50
[19:14:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:05] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@6dc89c0]: Bump cirrusSearchLinksUpdate concurrency to 50 (duration: 00m 53s)
[19:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:05] <marxarelli>	 train is rolling
[19:22:35] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631
[19:22:37] <wikibugs>	 (03CR) 10Dduvall: [C: 032] all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall)
[19:24:12] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall)
[19:24:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Add fake ssh keys for netbox user [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[19:25:00] <wikibugs>	 (03CR) 10Ayounsi: [V: 032 C: 032] Add fake ssh keys for netbox user [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[19:25:10] <wikibugs>	 (03CR) 10Krinkle: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[19:25:17] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[19:26:16] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.24
[19:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:22] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall)
[19:30:00] <marxarelli>	 seeing quite a rise in fatals
[19:30:37] <marxarelli>	 !log rise in fatals "Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.24/includes/Title.php"
[19:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:12] <marxarelli>	 possibly something temporary like bytecode cache warmup
[19:33:10] <wikibugs>	 (03CR) 10Dzahn: "in the past i have ran into puppet issues when attempting to create a new user and add it to groups in the same patch. i think it's a race" [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840) (owner: 10Herron)
[19:33:22] <greg-g>	 marxarelli: yeah, those are known, sadly
[19:33:36] <marxarelli>	 no, this is too long i think
[19:33:39] <marxarelli>	 it hasn't subsided
[19:33:51] <marxarelli>	 rolling back
[19:33:52] <greg-g>	 hmmm
[19:34:25] <wikibugs>	 (03PS1) 10Gehel: wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543)
[19:34:49] <greg-g>	 marxarelli: https://phabricator.wikimedia.org/T204871
[19:34:51] <marxarelli>	 entire fatalmonitor is filled with those "web request took longer than 60 seconds and timed out"
[19:35:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] "SSH key added to the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi)
[19:35:49] <marxarelli>	 i see
[19:35:51] <wikibugs>	 (03PS3) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898)
[19:36:07] <marxarelli>	 so the 60 request timeout is new?
[19:36:26] <marxarelli>	 or newly fixed?
[19:36:38] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10herron) While the optimal approach would be to dedicate new hardware to Kafka in both eqiad and codfw, after a few conversations within the infra team it sounds like a re...
[19:36:46] <greg-g>	 marxarelli: newly fixed
[19:36:51] <marxarelli>	 i'll wait on it then
[19:37:12] <greg-g>	 seems to have gone back down
[19:37:23] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) (owner: 10Gehel)
[19:37:23] <marxarelli>	 sort of
[19:37:35] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) (owner: 10Gehel)
[19:37:46] <wikibugs>	 (03PS2) 10Gehel: wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543)
[19:39:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:41:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:44:28] <wikibugs>	 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight)
[19:47:51] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) a:03Cmjohnson This is m5 master, let's get the disk replaced soon  Thanks!
[19:48:42] <ShakespeareFan00>	 Request from 88.97.96.89 via cp2002 cp2002, Varnish XID 131239612
[19:48:42] <ShakespeareFan00>	 Error: 429, Too Many Requests at Thu, 04 Oct 2018 19:48:24 GMT
[19:48:51] <ShakespeareFan00>	 https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf/page156-716px-The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf.jpg
[19:49:00] <ShakespeareFan00>	 Busy server?
[19:49:07] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) p:05Triage>03High
[19:49:31] <paladox>	 tooo many requests
[19:49:52] <ShakespeareFan00>	 The main issue seems to be images not loading at Wikisource
[19:50:12] <ShakespeareFan00>	 which is a tiresome as it uses image scans for proofreading 
[19:50:13] <Krenair>	 ShakespeareFan00, how big is this problem? can you load anything from upload.wm.o at all?
[19:50:50] <ShakespeareFan00>	 https://upload.wikimedia.org/wikipedia/commons/8/87/Donna_Strickland%2C_OSA_Holiday_Party_2012.jpg
[19:50:53] <ShakespeareFan00>	 Loaded
[19:50:59] <paladox>	 i can load images from Wikisource but on the link ShakespeareFan00 gave i can reproduce.
[19:51:31] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "lgtm, compiler run looks positive, also that it is already "beta-picked" speaks for it. (if that means the latest PS is picked)" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[19:51:39] <Krenair>	 so that one particular image
[19:51:47] <ShakespeareFan00>	 Or that DJVU
[19:51:50] <Krenair>	 any other URLs ShakespeareFan00?
[19:52:20] <ShakespeareFan00>	 https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf/page14-716px-The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf.jpg
[19:52:34] <ShakespeareFan00>	 Seem to that PDF specfically
[19:52:53] <Krenair>	 well
[19:52:55] <Krenair>	 open a task
[19:53:48] <Krenair>	 stick #thumbor and #media-storage and #traffic on there
[19:53:52] <Krenair>	 sigh
[19:56:14] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76
[19:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:00] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe)
[20:02:53] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:03:44] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:06:16] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187)
[20:10:18] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 (duration: 14m 04s)
[20:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:23] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:10:24] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[20:10:33] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:20:04] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.003 second response time
[20:21:14] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.060 second response time
[20:21:51] <wikibugs>	 (03CR) 10Cwhite: [C: 031] icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[20:22:59] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76
[20:23:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:10] <wikibugs>	 (03CR) 10Cwhite: [C: 031] icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn)
[20:23:16] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 (duration: 00m 17s)
[20:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "This is actually more complex than it looks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[20:31:33] <wikibugs>	 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) https://netbox.wikimedia.org/api/dcim/devices/1945/napalm/ > NAPALM is not installed. Please see the documentation for instructions.  While  ```lang=bash netmon1002:/srv/deployment/netbo...
[20:32:26] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "will now wait until after eqiad switch back and mwmaint1002 is confirmed working" [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[20:42:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[20:43:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[20:45:03] <wikibugs>	 (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[20:49:37] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) I sent an email to Ellie to follow-up.
[20:49:49] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) a:03Dzahn
[20:53:21] <wikibugs>	 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) a:05herron>03Dzahn
[20:56:03] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) p:05Triage>03Normal
[21:05:44] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 67, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:06:02] <wikibugs>	 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) Alright. Done!  I added this as an example on wikitech:  https://wikitech.wikimedia.org/wiki/Mailman#Real_world_exa...
[21:08:25] <wikibugs>	 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) 05Open>03Resolved added this to list description:  //This list has been disabled in favor of wikitech-l in http...
[21:14:23] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 031] Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[21:14:44] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10Dzahn) @Psychoslave Hi, is this ticket resolved or do you have further questions on the new list? Looks to me you are admin on https://lists.wikimedia.org/mail...
[21:15:09] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 031] Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[21:16:24] <icinga-wm>	 RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 108 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[21:19:50] <wikibugs>	 10Operations, 10ops-eqiad: helium (bacula) -  Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) I see.. hmm. yea, then we should buy a replacement disk.
[21:20:03] <bblack>	 XioNoX: ready to try the phab1001 port?
[21:20:20] <XioNoX>	 bblack: almost done with the DR call
[21:20:36] <XioNoX>	 bblack: link came up 10min into the meeting
[21:21:25] <wikibugs>	 (03PS7) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782)
[21:21:37] <bblack>	 ok
[21:22:31] <wikibugs>	 (03PS8) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782)
[21:22:33] <XioNoX>	 bblack: tldr for the equinix link, they connected it to the wrong port
[21:23:04] <XioNoX>	 still have to figure out who owns the part of the circuit that goes into UL space,
[21:23:34] <wikibugs>	 (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721
[21:23:36] <wikibugs>	 (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722
[21:23:41] <XioNoX>	 bblack: alright, let's look at phab1001
[21:24:58] <wikibugs>	 (03CR) 10Cwhite: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[21:25:01] <bblack>	 Krenair: ShakespearFan00: re: 429s on thumbs, there's oustanding tickets, it's a relatively-known issue: T175512 T151202 T203135 
[21:25:02] <stashbot>	 T175512: thumbor 429 throttled error messages are confusing - https://phabricator.wikimedia.org/T175512
[21:25:03] <stashbot>	 T203135: ThumbnailRender job fails with 429 errors - https://phabricator.wikimedia.org/T203135
[21:25:03] <stashbot>	 T151202: 429 Error generating thumbnail - https://phabricator.wikimedia.org/T151202
[21:25:15] <wikibugs>	 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) For what it's worth, the aliases in eqiad1 can be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464722/ or something like it.
[21:26:05] <bblack>	 all: there will be a very brief blip of traffic to all phabricator services!
[21:26:36] <bblack>	 XioNoX: log the port action here I guess and give it a swing!
[21:26:48] <Krenair>	 bblack, ok, thanks
[21:27:46] <XioNoX>	 !log bounce phab1001 switch port - T201039
[21:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:51] <stashbot>	 T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039
[21:28:33] <XioNoX>	 done
[21:29:00] <bblack>	 [Thu Oct  4 21:28:27 2018] tg3 0000:02:00.0 eth0: Link is down
[21:29:00] <bblack>	 [Thu Oct  4 21:28:53 2018] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[21:29:11] <XioNoX>	 hum, still no ND
[21:29:13] <bblack>	 still no ping6 to lvs1002 ephemeral v6
[21:29:17] <XioNoX>	 from lvs
[21:30:48] <bblack>	 works the other way around though
[21:30:59] <bblack>	 (ndisc6 for lvs1002's ipv6 works from phab1001)
[21:31:05] <XioNoX>	 yeah
[21:32:09] <bblack>	 so this means basically, probably, that multicast/broadcast v6 stuff isn't making it into phab1001's port, or isn't making it into lvs1002's port, I think?
[21:32:32] <bblack>	 I wonder if we can try some other fake multicast traffic to see if either of those are true, and sniff
[21:32:57] <bblack>	 sorry the earlier line should've said: isn't making it into phab1001's port, or isn't making it out of lvs1002's ports
[21:33:19] <bblack>	 it's the multicast that 1002 sends to solicit phab1001 that gets no answer
[21:33:34] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10eyoung) As Daniel offered above, I  need the passwords .  Can you reset it for me and I'll deal with it then, getting Irene on as new administrator as well.  Thanks All!  https:...
[21:33:35] <XioNoX>	 yeah, never makes it to phab1001
[21:34:04] <bblack>	 I wonder if the same flow is blocked for some other kind of multicast v6
[21:34:13] <bblack>	 (and to/from other hosts than that pair)
[21:34:19] <bblack>	 it's a very odd problem :P
[21:35:31] <XioNoX>	 we could install ndisc6 on all hosts and script some cumin, but that looks overkill
[21:36:04] <XioNoX>	 I have some notes/commands from a previous meeting with JTAC, will try to re-understand what they mean
[21:36:31] <XioNoX>	 basically to track a route to a specific mac inside the fabric
[21:37:13] <bblack>	 I ran through a global "ip -6 neigh show|grep FAIL" earlier FWIW
[21:37:35] <bblack>	 10% of all hosts have at least some FAILED entries, and most aren't related to asw2-b
[21:38:00] <bblack>	 so I don't think it's necessarily uncommon, and it could be we have some general edginess in our ipv6 setup that whatever this is, is just making worse
[21:45:07] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10User-Urbanecm: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10Dzahn) 05Open>03Resolved a:03Dzahn I logged in on the admin interface using the master password from pwstore. From there i followed the link to the archive...
[21:54:56] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 (owner: 10Andrew Bogott)
[21:55:24] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 (owner: 10Andrew Bogott)
[21:55:45] <XioNoX>	 bblack: I ran the same troubleshooting commands that I had from JTAC, but don't see the same symptoms there
[21:56:05] <wikibugs>	 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) The request was to also remove archives. This changed things and had to use "rmlist -a" instead.   ``` root@fermium...
[21:56:20] <XioNoX>	 next step is to open a jtac ticket I guess
[22:11:05] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) ``` [fermium:~] $ list_name="wikimania-com" ; sudo /var/lib/mailman/bin/change_pw -l $list_name -p $(pwgen -c1 -s 12) New wikimania-com password: <redacted>  [fermium:~]...
[22:18:49] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) 05Open>03Resolved Tentatively calling it resolved. If anything is not working as expected please let us know and we will reopen this right away.
[22:43:53] <icinga-wm>	 RECOVERY - ElasticSearch shard size check on search.svc.codfw.wmnet is OK: OK - All good!
[22:52:34] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[22:53:58] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) I'm afraid we have exhausted our options here.   It's a thing the list admins need to agree and change if they want to.  They have been pinged...
[22:56:22] <wikibugs>	 (03PS1) 10Nuria: Rotate logs in refinery based on time rather than size [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020)
[22:56:53] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[22:56:53] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) The right venue for this issue is still emailing the list owners at:  **wikimedia-l-owner@lists.wikimedia.org**  Or if that fails, wikimedia-l...
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T2300). Please do the needful.
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:01:13] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:03:24] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen