[00:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T0000). [00:03:53] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.4 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:10:57] SMalyshev: a little while ago the number of threads on wdqs just kept going up a lot [00:11:08] so no, i dont think network problems [00:11:48] queries per second did not go up though [00:12:23] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.22 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:14:22] SMalyshev: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&from=1538603824853&to=1538611796920&panelId=22&fullscreen&var-cluster_name=wdqs&refresh=1m [00:27:47] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Klein) It worked! Thank you all! :) [00:41:39] (03PS4) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:42:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:50:00] (03PS5) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:59:20] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10faidon) Note that we bought OpenGear adapters for the all the Atlases across all sites (incl. ulsfo) last year and shipped them to eqiad: T166715#3308801 [00:59:56] (03PS6) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:02:03] (03CR) 10Dzahn: "converted upload_rewrite to a struct." [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:16:03] PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.003 second response time [01:24:15] (03PS7) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:24:17] (03PS4) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:25:12] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:25:25] (03CR) 10Dzahn: "PS4: upload_rewrite is now a struct instead of a string" [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:27:45] (03PS8) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:27:47] (03PS5) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:27:49] (03PS3) 10Dzahn: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:32:54] (03PS9) 10Dzahn: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:36:13] PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [01:37:49] (03PS6) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:38:34] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:40:43] (03PS7) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:41:23] PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [01:41:26] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:47:40] (03PS8) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [01:49:22] PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [01:54:02] PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [01:58:43] PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [02:07:52] PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.001 second response time [02:09:42] PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [02:10:43] RECOVERY - WDQS HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.366 second response time [02:10:53] PROBLEM - High lag on wdqs2001 is CRITICAL: 1.076e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:12:12] PROBLEM - WDQS HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.533 second response time [02:13:13] RECOVERY - WDQS HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.097 second response time [02:13:43] PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [02:14:12] PROBLEM - High lag on wdqs2003 is CRITICAL: 1.096e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:14:42] PROBLEM - High lag on wdqs2002 is CRITICAL: 1.098e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:14:52] RECOVERY - WDQS HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.169 second response time [02:14:56] RECOVERY - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.093 second response time [02:43:04] !log depooled wdqs2001 to see if it catches up faster [02:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:53] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:10:43] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:16:14] (03PS8) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [03:16:47] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [03:17:43] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 289 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:21:05] !log repooled wdqs2001 [03:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:43] !log depool wdqs2003 to let it catch up [03:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:11] (03PS9) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [03:30:09] (03PS10) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [03:34:11] (03CR) 10Mathew.onipe: "Jenkins dry run:" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [03:35:23] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 780 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:01:52] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 694 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:02:02] PROBLEM - pdfrender on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:03:40] (03CR) 10Smalyshev: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [04:04:39] !log repooled wdqs2003 [04:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:22] RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time [04:21:20] (03PS11) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [05:25:33] (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) [05:27:31] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:29:13] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:30:24] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2062 (duration: 00m 57s) [05:30:25] !log Deploy schema change on db2062 - T205913 [05:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:36] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [05:30:54] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 [05:32:41] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui) [05:33:29] (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464479 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:34:21] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui) [05:35:27] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2062 (duration: 00m 56s) [05:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:20] !log Deploy schema change on db2048 (s1 master) - T205913 [05:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:24] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [05:37:33] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: remove the restart parsoid step, now useless [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto) [05:48:21] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2062" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464480 (owner: 10Marostegui) [05:53:18] (03CR) 10Giuseppe Lavagetto: [C: 032] sre.switchdc.mediawiki: remove the restart parsoid step, now useless [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto) [06:04:39] (03PS2) 10Jcrespo: mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) [06:12:14] (03CR) 10Jcrespo: [C: 031] "Compression finished:" [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [06:13:03] (03CR) 10Marostegui: [C: 031] mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [06:17:12] (03CR) 10Jcrespo: [C: 032] mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [06:19:11] (03PS4) 10Giuseppe Lavagetto: service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 [06:21:45] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Aklapper) 05stalled>03Open [06:22:25] (03CR) 10Krinkle: [C: 031] Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [06:24:53] !log create manual backup of databases on eqiad s6, s7, s8, x1 [06:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:13] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:25:54] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05Open>03stalled p:05Triage>03Normal Stalled as the server hasn't been received yet [06:26:17] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:27:58] (03PS1) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) [06:29:13] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:29:22] (03PS2) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) [06:29:32] (03PS3) 10Zoranzoki21: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) [06:31:52] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10Marostegui) The following hosts (aside from the ones above) will need to be downtimed too: db1117, db2042 and db2078 (they replicate from db1072 and db1073) db2037 (replicates from d... [06:33:37] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 (owner: 10Giuseppe Lavagetto) [06:34:53] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:38:02] (03PS1) 10KartikMistry: [WIP] apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 [06:38:30] (03PS2) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447) [06:40:57] (03CR) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [06:41:09] (03PS1) 10Alexandros Kosiaris: mathoid: Add nomial resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483 [06:41:52] (03PS2) 10Alexandros Kosiaris: mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483 [06:44:46] (03PS2) 10Giuseppe Lavagetto: parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 [06:51:07] !log reenabling consistency configuration on s5 replica databases [06:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:07] (03PS4) 10Zoranzoki21: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) [06:53:23] (03PS5) 10Zoranzoki21: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) [06:57:46] !log starting multisource replication of s3 from s5 at eqiad master [06:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:24] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:10:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:14:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:23:13] PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:23:13] PROBLEM - swift-account-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:23:14] PROBLEM - swift-object-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:23:23] PROBLEM - swift-container-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:23:24] PROBLEM - swift-container-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:23:24] PROBLEM - swift-container-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:23:43] PROBLEM - swift-account-reaper on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:23:44] PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:23:50] that's me [07:23:50] <_joe_> is someone working on ms-be1041? [07:23:53] <_joe_> ok [07:23:53] PROBLEM - swift-container-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:23:54] <_joe_> :P [07:23:54] PROBLEM - swift-account-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:24:03] PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [07:24:04] dammit I thought I silenced it [07:24:04] PROBLEM - swift-account-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:24:07] sorry about the spam [07:25:40] !log reformat ms-be1041 with crc=1 finobt=0 - T199198 [07:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:45] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:28:43] RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:28:43] RECOVERY - swift-account-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:28:44] RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:28:54] RECOVERY - swift-container-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:29:03] RECOVERY - swift-container-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:29:03] RECOVERY - swift-container-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:29:14] RECOVERY - swift-account-reaper on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:29:14] RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:29:23] RECOVERY - swift-container-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:29:24] RECOVERY - swift-account-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:29:25] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts - https://phabricator.wikimedia.org/T206187 (10Mathew.onipe) [07:29:34] RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [07:29:43] RECOVERY - swift-account-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:31:17] !log move Piwik/Matomo from bohrium to matomo1001 - T202962 [07:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:23] T202962: Upgrade bohrium (piwik/matomo) to Debian Stretch - https://phabricator.wikimedia.org/T202962 [07:35:33] (03PS3) 10Elukey: role::cache::text: add a backend for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) [07:36:14] (03CR) 10Elukey: [C: 032] role::cache::text: add a backend for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [07:40:50] (03PS1) 10Zoranzoki21: Edited syntax of the code where the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [07:40:52] (03CR) 10Mathew.onipe: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [07:41:46] (03PS2) 10Zoranzoki21: Edited syntax of the code where the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [07:42:06] (03PS3) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [07:43:21] (03PS3) 10Elukey: Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) [07:44:07] (03CR) 10Mathew.onipe: "> Patch Set 11: Verified+2" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [07:44:27] (03CR) 10Muehlenhoff: [C: 04-1] "Some comments inline" (033 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [07:49:42] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) [07:52:15] (03CR) 10KartikMistry: "kartik@scb2001:~$ locale -a | grep UTF-8" [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [07:57:19] (03CR) 10Elukey: [C: 032] Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [08:00:37] !log re-enabling puppet on maps1004 [08:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:33] (03CR) 10Jcrespo: "> but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:09:16] !log Restart icinga T196336 [08:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:21] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [08:12:32] (03CR) 10Banyek: "> > but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:13:32] (03CR) 10Marostegui: "> > but we still need to support jessie" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:13:59] (03CR) 10Muehlenhoff: [C: 04-1] "Perfect, then we can use dh_sysuser instead, which reduces this to a very small change, see https://manpages.debian.org/stretch/dh-sysuser" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [08:14:01] ACKNOWLEDGEMENT - ElasticSearch shard size check on search.svc.codfw.wmnet is CRITICAL: CRITICAL - cebwiki_content_1521724408(51gb) Mathew.onipe This is mostly caused by segment merges - T206187 - The acknowledgement expires at: 2018-10-05 20:10:28. [08:15:38] ACKNOWLEDGEMENT - ElasticSearch shard size check on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1537913194(61gb) Mathew.onipe This is mostly caused by segment merges - T206187 - The acknowledgement expires at: 2018-10-05 20:10:12. [08:20:55] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) Apparently @Mattflaschen-WMF is no more in charge, who is in charge of flow maintenance now, maybe #gro... [08:34:00] !log installing ca-certificates updates for jessie/stretch [08:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:09] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [08:45:13] (03PS1) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [08:46:42] (03PS2) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [08:49:46] (03PS2) 10Volans: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) [08:52:18] !log installing python2.7/python3.4/python3.5 security updates on jessie/stretch [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] (03CR) 10Volans: [C: 031] "LGTM, don't forget to commit the real one in the private repo before merging the puppet change" [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [08:52:56] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:54:22] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:54:28] (03PS3) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [08:55:42] (03CR) 10Volans: [C: 04-1] "I think we need another notify too, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [08:58:03] (03CR) 10Gehel: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [08:59:30] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12763/" [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [08:59:40] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only makes sense in active control node [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) [09:00:12] (03PS2) 10Elukey: Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962) [09:03:00] 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Volans) As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icinga load as 1 fleet wide check => 1300 ch... [09:05:10] (03CR) 10Volans: "FYI I've updated the Switch Datacenter wiki page that was left behind after this change ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 (owner: 10Giuseppe Lavagetto) [09:05:52] (03CR) 10Elukey: [C: 032] Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:06:29] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:07:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:07:51] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only makes sense in active control node [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) [09:12:06] (03CR) 10Gehel: [C: 04-1] "see comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [09:13:21] !log T203177 schedule 8h icinga downtime for cloudcontrol1003,1004 and labmon1001 [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:26] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [09:14:23] (03CR) 10Filippo Giunchedi: "FWIW the collected metrics are not going to be duplicated in the sense that they would have different "instance" tags for each cloudcontro" [puppet] - 10https://gerrit.wikimedia.org/r/464493 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:17:47] arturo: FYI ^ [09:18:57] godog: something from wikibugs? I ignore it [09:19:10] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 43.56 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:20:06] oh a gerrit comment, reading [09:21:12] (03PS1) 10Muehlenhoff: Add library hints for Pythons [puppet] - 10https://gerrit.wikimedia.org/r/464494 [09:21:31] arturo: yeah that one [09:22:29] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:22:30] godog: what benefit do you see in storing both metrics? [09:24:00] (03CR) 10Muehlenhoff: [C: 032] Add library hints for Pythons [puppet] - 10https://gerrit.wikimedia.org/r/464494 (owner: 10Muehlenhoff) [09:24:05] making sure metrics collection works all the time, regardless of active/passive mostly [09:24:27] so then when you switch you have to think about one less thing and actually see the switch happening [09:25:13] makes sense [09:26:08] on the other hand, the passive node is not even that, is just a cold spare which is not expected to go into service unless we have serious issues [09:27:10] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: metrics: cleanup unused hiera datafile [puppet] - 10https://gerrit.wikimedia.org/r/464495 (https://phabricator.wikimedia.org/T203177) [09:28:38] yeah, my point being that when you actually put it in service you already know you have metrics [09:30:07] godog: fair enough, I will revert and think on doing some filters in the grafana :-) thanks [09:30:48] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/464495 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:32:31] arturo: sounds good! you can either add an "instance" templating variable or depending on the metrics you can query for values > 0 and that should do the right thing [09:32:49] or display all instances with {{instance}} in the legend template [09:33:36] godog: ok, will investigate and ask for your help in don't manage to do it myself :-P [09:35:03] (03PS1) 10Arturo Borrero Gonzalez: Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) [09:35:20] godog: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464496 +1 welcome :-) [09:35:40] (03PS5) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [09:37:11] (03CR) 10Filippo Giunchedi: [C: 031] Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:37:13] yup, lgtm! [09:37:20] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:37:30] (03PS2) 10Arturo Borrero Gonzalez: Revert "prometheus-openstack-exporter: only makes sense in active control node" [puppet] - 10https://gerrit.wikimedia.org/r/464496 (https://phabricator.wikimedia.org/T203177) [09:37:38] 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) >>! In T206114#4641114, @Volans wrote: > As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icin... [09:41:01] (03CR) 10Volans: [C: 04-1] "Generally looks ok, small nits around and a couple of questions. Also I've just skimmed the tests as I'm totally not familiar with them an" (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [09:43:13] (03CR) 10Volans: Netbox, set the napalm_username variable and matching keyholder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [09:46:58] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: metrics: adjust depedency on novaenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) [09:47:47] (03CR) 10Jcrespo: "Genuine question I don't know, see below" (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [09:47:49] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: metrics: adjust depedency on novaenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:48:58] (03CR) 10Jcrespo: "Another question, sorry for my ignorance." (031 comment) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [09:52:38] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) @DarTar could you please check this, and if ok, approve it? Thanks! [09:52:39] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) [09:53:18] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:57:26] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) [10:02:04] (03PS13) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [10:02:30] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6294.69 seconds [10:04:14] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) [10:05:38] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) [10:06:31] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [10:06:34] (03PS1) 10Elukey: Move _etcd._tcp* SRV records to etcd codfw [dns] - 10https://gerrit.wikimedia.org/r/464503 (https://phabricator.wikimedia.org/T205814) [10:08:14] !log rolling reboot of ms-fe hosts in eqiad for kernel security update [10:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:13] (03PS1) 10Alexandros Kosiaris: mathoid: Switch liveness probe into tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464504 [10:10:42] (03PS1) 10Alexandros Kosiaris: Set the scaffolding's livenessProbe to tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464505 [10:11:52] the dbstore1001 lag may be the backups [10:12:01] I will disable those alerts and setup a comment [10:12:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] role::configcluster_stretch: enable etcd replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [10:12:51] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) [10:13:42] (03CR) 10Elukey: role::configcluster_stretch: enable etcd replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [10:14:00] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: metrics: we don't require observerenv [puppet] - 10https://gerrit.wikimedia.org/r/464500 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [10:16:17] (03PS4) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [10:18:29] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:18:29] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 46.88 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:20:44] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:25:52] (03CR) 10Vgutierrez: Detect when cert config changes and re-issue (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [10:25:57] (03PS14) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [10:26:56] (03CR) 10Filippo Giunchedi: [C: 031] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:28:29] (03CR) 10Filippo Giunchedi: [C: 031] "Nit inline, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:28:42] (03CR) 10Banyek: wmf-pt-kill: WMF patched version 2 (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [10:29:37] (03PS1) 10Giuseppe Lavagetto: Repackaging for stretch [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/464507 (https://phabricator.wikimedia.org/T205814) [10:30:40] (03CR) 10jerkins-bot: [V: 04-1] Repackaging for stretch [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/464507 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [10:31:29] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, added a few WMCS folks for confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:32:54] (03PS5) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [10:33:45] (03CR) 10Giuseppe Lavagetto: [C: 031] "+1 but wait for the etcd-mirror package to be available." [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [10:33:48] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, and 2 others: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell) [10:35:48] (03CR) 10Muehlenhoff: [C: 04-1] wmf-pt-kill: WMF patched version 2 (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [10:36:32] <_joe_> !log uploading etcd-mirror to stretch-wikimedia T205814 [10:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:37] T205814: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 [10:36:58] (03PS6) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [10:37:59] (03CR) 10Alex Monk: [C: 04-1] "also see PS36 comment" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [10:38:08] !log upload scap 3.8.7-1 - T204383 [10:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:12] T204383: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 [10:38:56] (03PS4) 10Filippo Giunchedi: Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (https://phabricator.wikimedia.org/T204383) (owner: 1020after4) [10:39:04] (03PS6) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [10:39:39] (03CR) 10Filippo Giunchedi: [C: 032] Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (https://phabricator.wikimedia.org/T204383) (owner: 1020after4) [10:41:59] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.92 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:43:22] I've silenced that alert for ulsfo, depooled [10:44:17] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.23/README: noop sync to verify that scap 3.8.7-1 works (at least on a basic level) (duration: 00m 59s) [10:44:18] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10fgiunchedi) [10:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:21] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, and 2 others: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done! 3.8.7-1 is live [10:44:56] Thanks for uploading the new version godog! [10:45:07] np twentyafterfour [10:47:33] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 (owner: 10Banyek) [10:53:05] (03PS1) 10Sbisson: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:43] Present [11:02:47] zeljkof, will you swat? [11:03:54] o/ [11:03:58] I can SWAT today [11:04:02] Urbanecm: yes! :D [11:04:30] :D [11:05:51] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21) [11:07:43] (03Merged) 10jenkins-bot: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21) [11:08:08] (03CR) 10jenkins-bot: Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463582 (https://phabricator.wikimedia.org/T205747) (owner: 10Zoranzoki21) [11:08:39] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:08:53] Urbanecm: 463582 is at mwdebug2001 [11:10:04] zeljkof, are you sure it is on mwdebug2001? [11:10:19] Ah, sorry [11:10:40] Yeah, it is working, was checking in wrong way [11:10:43] zeljkof ^^ [11:10:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:11:48] (03PS39) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [11:16:18] Urbanecm: sorry, got distracted, deploying [11:16:31] ok [11:17:31] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463582|Create Photowalk and Photowalk Talk namespaces for bd.wikimedia.org (T205747)]] (duration: 00m 57s) [11:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:36] T205747: Create new Photowalk namespace for bd.wikimedia.org - https://phabricator.wikimedia.org/T205747 [11:17:40] Urbanecm: deployed ^ [11:17:56] thank you. Can you run namespaceDupes.php to be sure there's nothing inaccessible? [11:18:32] zeljkof, ^ [11:18:33] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:18:41] Urbanecm: sure [11:18:45] thank you [11:22:15] Urbanecm: done T205747#4641406 [11:22:28] thx [11:22:48] (03CR) 10Zfilipin: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:22:55] (03PS6) 10Zfilipin: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:23:12] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:25:17] (03Merged) 10jenkins-bot: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:25:50] Urbanecm: 463584 is at mwdebug [11:25:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:26:04] testing [11:26:28] (03PS2) 10Zfilipin: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:26:37] zeljkof, working, please deploy [11:26:44] ok [11:27:38] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463584|Change acewiki default time zone to Asia/Jakarta (T205693)]] (duration: 00m 56s) [11:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:43] T205693: Change acewiki default time zone to Asia/Jakarta - https://phabricator.wikimedia.org/T205693 [11:27:50] Urbanecm: deployed ^ [11:27:55] thx [11:28:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:30:19] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:30:35] (03Merged) 10jenkins-bot: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:32:20] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:32:38] Urbanecm: 463227 at mwdebug2001 [11:33:39] zeljkof, working, please deploy (and run namespaceDupes.php afterwards).- [11:34:43] ok [11:35:49] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463227|Add some namespaces aliases for zhwikiversity (T201675)]] (duration: 00m 57s) [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:54] T201675: Create new namespaces in zhwikiversity - https://phabricator.wikimedia.org/T201675 [11:35:58] Urbanecm: deployed ^ [11:36:35] thank you. [11:38:03] hmm, 3 links to fix, 2 were resolvable. [11:38:09] Noting, will investigate later [11:38:19] Urbanecm: yeah, one problem [11:38:35] (03PS2) 10Zfilipin: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm) [11:38:50] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:39:05] (03CR) 10jenkins-bot: Change acewiki default time zone to Asia/Jakarta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463584 (https://phabricator.wikimedia.org/T205693) (owner: 10Zoranzoki21) [11:39:07] (03CR) 10jenkins-bot: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [11:39:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm) [11:40:25] zeljkof, please push wgCopyUploadsDomains patches directly to prod, nothing to test for me. Thank you! [11:40:41] (03Merged) 10jenkins-bot: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm) [11:40:42] Urbanecm: ok [11:41:46] Urbanecm: merge conflict for 457474 [11:41:57] (not resolvable in gerrit) [11:42:01] will fix zeljkof [11:42:13] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457469|Add .bollywoodhungama.in to wgCopyUploadsDomains (T203363)]] (duration: 00m 57s) [11:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:17] T203363: Please add http://www.bollywoodhungama.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T203363 [11:42:56] Urbanecm: 457469 deployed ^ [11:43:02] thx [11:45:31] Urbanecm: merge conflict also for 460700 [11:45:38] fixing both [11:47:16] (03PS3) 10Urbanecm: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) [11:47:20] ^^ zeljkof ^^ [11:47:33] Urbanecm: on it [11:47:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:48:02] (03PS2) 10Urbanecm: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) [11:48:07] and the second one ^^ [11:48:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm) [11:50:07] (03Merged) 10jenkins-bot: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm) [11:50:35] Urbanecm: the last two patches should be deployed without mwdebug? [11:50:39] yes [11:51:36] Urbanecm: still conflict for 460700 [11:51:41] (the last one) [11:51:53] ok, probably the previous patch caused another conflict, fixing [11:52:02] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457474|add Radlines.org to $wgCopyUploadsDomains (T203219)]] (duration: 00m 57s) [11:52:05] yes, the last 3 patches seem to touch the same line [11:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:07] T203219: Please add Radlines.org to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T203219 [11:52:23] Urbanecm: 457474 deployed ^ [11:52:40] (03PS3) 10Urbanecm: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) [11:52:42] fixed ^^ [11:53:43] (03CR) 10jenkins-bot: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) (owner: 10Urbanecm) [11:53:45] (03CR) 10jenkins-bot: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) (owner: 10Urbanecm) [11:53:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm) [11:55:56] (03Merged) 10jenkins-bot: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm) [11:57:20] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460700|Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons (T203371)]] (duration: 00m 56s) [11:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:26] T203371: Please add nasimonline.ir to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T203371 [11:57:29] Urbanecm: all deployed! [11:57:33] thank you! [11:57:40] !log EU SWAT finished [11:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:25] well, do you have time for one additional patch zeljkof? :D MW train is not deployed in its EU version, so I hope it would be possible :) [11:58:31] its https://gerrit.wikimedia.org/r/464481 [11:59:06] (03PS4) 10Zfilipin: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21) [11:59:15] Urbanecm: sure [11:59:18] thank you [11:59:59] Urbanecm: just please update the calendar [12:00:02] will do, thanks [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1200) [12:00:07] !log one more patch for EU SWAT [12:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:11] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21) [12:01:23] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) [12:01:52] !log rolling reboot of ms-fe hosts in codfw for kernel security update [12:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:07] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:03:00] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) [12:03:09] (03Merged) 10jenkins-bot: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21) [12:03:29] are there any difference bwteen the eqiad and codfw mw redis lcok servers? [12:03:33] *lock [12:04:11] Urbanecm: 464481 at mwdebug [12:04:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:04:43] zeljkof, working, please deploy [12:06:19] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:464481|Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki (T205595)]] (duration: 00m 57s) [12:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:23] T205595: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki - https://phabricator.wikimedia.org/T205595 [12:06:41] Urbanecm: all deployed, please check and thanks for deploying with #releng! ;) [12:06:47] !log EU SWAT finished [12:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:06] thank you zeljkof for deploying my and Zoranzoki21's patches! [12:07:57] Urbanecm: no problemo :D [12:09:13] (03CR) 10jenkins-bot: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) (owner: 10Urbanecm) [12:09:15] (03CR) 10jenkins-bot: Add permission "move-rootuserpages" to usergroup "eliminator" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464481 (https://phabricator.wikimedia.org/T205595) (owner: 10Zoranzoki21) [12:14:49] (03CR) 10GTirloni: [C: 032] openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [12:18:34] (03PS7) 10Elukey: role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) [12:22:49] (03PS1) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) [12:23:00] (03CR) 10Elukey: [C: 032] role::configcluster_stretch: enable etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/464492 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [12:23:56] !log deploy etcdmirror on conf1005 - T205814 [12:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:01] T205814: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 [12:24:40] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:24:48] (03CR) 10Volans: [C: 031] "LGTM, great job Matt! One documentation nitpick inline, but feel free to merge as is." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:26:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:35:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:39:47] (03CR) 10Alexandros Kosiaris: [C: 031] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [12:45:07] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [12:46:09] (03PS12) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [12:49:45] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:49:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:50:38] (03CR) 10Mathew.onipe: "Jenkins dry run: https://puppet-compiler.wmflabs.org/compiler1002/12770/wdqs1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [12:52:18] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) My current understanding of the issue: All IRQs from NIC are handled by a single CPU. Under load, Blazegraph satur... [12:52:38] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) a:03Gehel [12:57:22] (03CR) 10Gehel: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [12:59:35] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1300) [13:02:17] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Scap, and 2 others: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10mobrovac) >>! In T205981#4634221, @Gehel wrote: > I can confirm that @Mathew.onipe needs to be able to deploy wikidata query serv... [13:02:26] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Add cumin aliases for each wdqs clusters - https://phabricator.wikimedia.org/T205542 (10Mathew.onipe) 05Open>03Resolved [13:03:38] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) [13:03:45] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[chown /etc/eventstreams/config.yaml],Package[electron-render/deploy],Exec[chown /srv/deployment/electron-render for deploy-service] [13:04:27] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:07:05] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:08:09] (03CR) 10Gehel: [C: 04-1] "A few minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:08:14] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:09:16] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 9.002 ge 4 Muehlenhoff T200678 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [13:09:30] (03PS4) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [13:11:32] (03PS2) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) [13:11:47] (03PS2) 10Ottomata: Install python 2 variant of sklearn on stat machines [puppet] - 10https://gerrit.wikimedia.org/r/464425 (owner: 10Gilles) [13:12:03] (03CR) 10Ottomata: [V: 032 C: 032] Install python 2 variant of sklearn on stat machines [puppet] - 10https://gerrit.wikimedia.org/r/464425 (owner: 10Gilles) [13:14:05] !log muting alerts on dbstore2002 and resuming compression of s2 database tables (T204930) [13:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:12] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [13:14:26] !log muting alerts on s2replication @dbstore2002 and resuming compression of s2 database tables (T204930) [13:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:48] (03PS3) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) [13:15:08] (03PS4) 10Elukey: Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) [13:19:42] 10Operations, 10Analytics, 10hardware-requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10elukey) p:05Triage>03Normal [13:20:12] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10elukey) [13:21:21] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Ottomata) I think `analytics-admins` is the right group let's keep using it! [13:26:20] (03CR) 10Ottomata: [C: 031] Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [13:29:05] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:32:39] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Ottomata) @HaeB do you still need this? Can we roll this back? [13:38:18] (03CR) 10Elukey: [C: 032] Apply a limited analytics coordinator role to an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/464523 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [13:40:10] hey all, I'm deploying the Proton on production to the https://gerrit.wikimedia.org/r/#/c/mediawiki/services/chromium-render/deploy/+/464558/ [13:40:24] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10akosiaris) 05Resolved>03Open I followed http://erikimh.com/megacli-cheatsheet/ to do so and ``` megacli -PdReplaceMissing -PhysDrv [15:9] -Array0 -row9 -a0... [13:41:25] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[python-sklearn] [13:41:45] fixing --^ [13:42:16] (03PS1) 10Ottomata: Add Accept header to varnishkafka webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) [13:43:39] !log pmiazga@deploy1001 Started deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:46] T201158: [4hrs] Have a Grafana dashboard for Proton - https://phabricator.wikimedia.org/T201158 [13:43:46] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [13:44:38] (03CR) 10Bstorm: "Going to get the initial actor table patch deployed fully before I merge this. Also, I'll test it locally." [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [13:46:34] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:46:35] !log pmiazga@deploy1001 Finished deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) (duration: 02m 55s) [13:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:21] PROBLEM - Etcd replication lag on conf1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.002 second response time [13:48:45] <_joe_> uhm [13:48:51] <_joe_> elukey: didn't you disable notifications? [13:48:55] what's up? [13:49:02] yeah but for an hour, it must have expired, my bad [13:49:04] <_joe_> anyways, everyone, we're installing that [13:49:08] <_joe_> server [13:49:15] thanks for the heads up [13:49:48] added downtime for 4 hours [13:49:53] ah ok, so no worries :) [13:54:49] (03PS1) 10Elukey: profile::analytics::database::meta: fix require for stretch [puppet] - 10https://gerrit.wikimedia.org/r/464568 (https://phabricator.wikimedia.org/T205509) [13:55:08] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747) [13:55:40] (03CR) 10Elukey: [C: 032] profile::analytics::database::meta: fix require for stretch [puppet] - 10https://gerrit.wikimedia.org/r/464568 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [13:59:14] (03PS1) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notification for check_elasticsearch_shard_size [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) [14:00:08] (03CR) 10jerkins-bot: [V: 04-1] icinga::monitor::elasticsearch: throttle alerts notification for check_elasticsearch_shard_size [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [14:01:29] (03CR) 10Mathew.onipe: "@Filippo: the aim is to retry after 6 hours thrice before it finally throws an alert. Please confirm if this CR takes care of this." [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [14:03:24] (03PS2) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) [14:03:30] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10kostajh) @jcrespo yes, #growth-team is handling #structureddiscussions. > Not doing this may soon block T106386... [14:06:38] (03CR) 10Ottomata: "Tested in deployment-prep, works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) (owner: 10Ottomata) [14:09:42] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: fix regex in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/464573 [14:09:50] !log Sanitize enwikivoyage cebwiki shwiki srwiki mgwiktionary on db1124:3315 T184805 [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [14:15:16] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [14:16:28] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [14:16:29] RECOVERY - Etcd replication lag on conf1005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.002 second response time [14:16:53] eyeroll [14:17:06] (03CR) 10Gehel: "Minor comment inline. I'd like Filippo to go over this to validate this does what I think it does (I'm sometimes confused by Icinga)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [14:17:44] (03PS13) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [14:19:32] (03CR) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [14:29:31] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [14:29:31] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10greg) >>! In T205981#4641753, @Gehel wrote: > @greg it looks like we need your approval to a... [14:30:05] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) @kostajh I don't have a say on that, was just pointing we are waiting for someone to take a lead, and o... [14:34:19] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10DarTar) Approved, thanks. [14:34:36] (03PS2) 10Marostegui: wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [14:39:53] (03PS3) 10Alexandros Kosiaris: mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483 [14:39:55] (03PS2) 10Alexandros Kosiaris: mathoid: Switch liveness probe into tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464504 [14:39:57] (03PS2) 10Alexandros Kosiaris: Set the scaffolding's livenessProbe to tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464505 [14:45:31] 10Operations, 10cloud-services-team: WMCS: Fewer transitory middle-of-the-night puppet alerts - https://phabricator.wikimedia.org/T206224 (10Andrew) [14:47:58] (03PS1) 10Alexandros Kosiaris: scaffold: Add some sample requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464578 [14:48:00] (03PS1) 10Alexandros Kosiaris: mathoid: Bump num_workers to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464579 [14:48:02] (03PS1) 10Alexandros Kosiaris: mathoid: Bump chart version to 0.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464580 [14:48:04] (03PS2) 10Dduvall: Use sed instead of envsubst [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 [14:48:51] !log depooling labsb1010 (T195747) [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:57] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [14:49:23] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464569 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [14:49:30] (03CR) 10Dduvall: "Thought I'd resurrect this patchset one more time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall) [14:49:36] (03CR) 10Alexandros Kosiaris: [C: 032] ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall) [14:49:44] (03PS2) 10Alexandros Kosiaris: ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall) [14:49:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) (owner: 10Dduvall) [14:50:20] marostegui: merging yours as well [14:50:27] akosiaris: mine? [14:50:38] Marostegui: wiki replicas: depool labsdb1010 to add initial actor table changes to views (cd7d11227b) [14:50:40] ? [14:50:44] good thing I pinged [14:50:47] akosiaris: that is banyek [14:50:55] (03CR) 10Gehel: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [14:51:06] @akosaris shall I merge yours too? [14:51:14] akosaris: shall I merge yours too? [14:51:18] lol [14:51:22] yes [14:51:30] hm [14:51:39] so we list the uploader but not the commiter there [14:51:47] akosiaris: I did the last rebase indeed [14:51:49] and on purpose now that I think about it [14:52:07] akosiaris: It is fine, brooke sent the patch, i rebased it and banyek +2 it [14:52:10] XD [14:52:12] yeah I saw [14:52:30] I was just wondering why I a different username but ok [14:52:44] I do even remember why we did it that way and not the other way around [14:53:42] (03PS2) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) [14:54:00] (03PS2) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) [14:54:19] (03PS2) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) [14:55:25] (03PS4) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) [14:55:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) Let's please update docs: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [14:58:51] (03PS1) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 [14:59:39] (03CR) 10jerkins-bot: [V: 04-1] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul) [15:01:27] (03CR) 10Dzahn: [C: 031] "took the generated config from compiler, copied it to mwdebug1001 to replace wiktionary.conf. ran apache-fast-test from deploy1001 with th" [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:01:58] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Nuria) +1 to analytics-admins [15:04:54] (03CR) 10Vgutierrez: [C: 04-1] "careful with the regex, lvs20010 != lvs2010 :)" [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul) [15:08:25] Hi, FYI, we're going to start the asw2-b-eqiad recabling work in ~1h, see https://phabricator.wikimedia.org/T201039 for the list of hosts impacted (and the email sent to ops@) [15:09:55] ack thanks :) [15:13:08] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) The disk was a spare...i didn't even look to see that it was a SATA disk. This server is out of warranty and we'll need to buy 4TB SAS disks [15:13:49] (03CR) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [15:18:44] (03CR) 10Gehel: [C: 031] "LGTM, I will check with Stas before merging." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [15:19:40] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) Maybe it makes sense to prioritize T196478 instead? [15:20:10] (03CR) 10C. Scott Ananian: "The dependency went out in 1.32.0-wmf.23 last week, and so should be safe to merge today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [15:20:21] (03PS5) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 [15:20:50] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:20:59] (03CR) 10Vgutierrez: [C: 032] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:23:22] (03Merged) 10jenkins-bot: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:25:36] (03PS2) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 [15:25:41] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10akosiaris) >>! In T205364#4642331, @Dzahn wrote: > Maybe it makes sense to prioritize T196478 instead? That's what we 've being down up to now more or less. But it doesn't look good either... [15:26:13] (03CR) 10jenkins-bot: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:26:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) 05Open>03stalled [15:26:22] (03CR) 10jerkins-bot: [V: 04-1] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul) [15:26:28] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) [15:26:41] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) Since this relates to an access request that was already approved at this weeks SRE... [15:27:06] (03PS2) 10Herron: admin: add Matt(onimisionipe) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/463967 (https://phabricator.wikimedia.org/T205981) (owner: 10Mathew.onipe) [15:27:44] (03PS3) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) [15:27:53] (03CR) 10Cwhite: [C: 032] hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:28:03] (03CR) 10Herron: [C: 032] admin: add Matt(onimisionipe) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/463967 (https://phabricator.wikimedia.org/T205981) (owner: 10Mathew.onipe) [15:28:47] (03CR) 10Vgutierrez: "Cool! add the missing space between Bug: and T196560 in the commit message and it's ready to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/464584 (owner: 10Papaul) [15:32:16] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10herron) 05Open>03Resolved a:03herron Change has been merged and will propagate out acr... [15:34:10] (03PS3) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) [15:34:27] (03PS4) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) [15:34:54] (03PS5) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) [15:35:19] (03PS2) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) [15:35:40] (03PS3) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) [15:36:53] (03PS3) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) [15:38:21] (03CR) 10Volans: [C: 031] "LGTM, it might even work at first try :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [15:39:18] (03CR) 10Herron: ircecho: restart service on change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [15:41:46] !log depool kafka1002 from eventbus as precautionary step for T201039 [15:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:51] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [15:43:46] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:58] (03PS3) 10Papaul: iPartman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) [15:49:18] ^ shdubsh: caused by your patch: [15:49:32] Error while evaluating a Resource Statement, Duplicate declaration: Package[diamond] is already declared in file /etc/puppet/modules/standard/manifests/diamond.pp:23; cannot redeclare at /etc/puppet/modules/diamond/manifests/init.pp:69 at /etc/puppet/modules/diamond/manifests/init.pp:69:5 at /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp:32 on node etherpad1001.eqiad.wmnet [15:49:35] (03PS1) 10Cwhite: standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) [15:49:47] papaul: I think you accidentally added an "i" at the beginning of the commit message :( [15:49:51] (03PS3) 10Mathew.onipe: icinga::monitor::elasticsearch: throttle alerts notifications [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) [15:50:04] (03CR) 10Volans: [C: 031] "LGTM, thanks for taking care of this." [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [15:50:19] moritzm: indeed. hoping the latest patch will alleviate [15:51:02] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444 [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:06] T199444: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444 [15:52:42] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444 (duration: 01m 40s) [15:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:41] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444 [15:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:36] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444 (duration: 00m 55s) [15:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:31] ACKNOWLEDGEMENT - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cole_white https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464595/ [15:55:41] jouncebot: next [15:55:42] In 0 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1600) [15:55:47] shdubsh: that would end up in a probably a whack-a-mole of various occurrences of diamond::collector->absent, we probably need to look into a difference fix [15:56:23] !log icinga downtime every server with the cloudXXXX scheme for 2h T201039 [15:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:27] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [15:58:33] !log icinga downtime every server in the main cloudvps deployment for 2h T201039 [15:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1600). Please do the needful. [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:11] o/ [16:00:12] !log Stop MySQL on db1073 for mariadb and kernel upgrade - T201039 T148507 [16:00:15] arturo: ^ [16:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:21] ack [16:00:52] kernel upgrade? I though this was just about the recabling? [16:01:09] arturo: we are talking the window also to upgrade mysql move the socket to the correct path and upgrade kernel [16:01:13] (and upgrade mysql) [16:01:17] *taking the window [16:01:27] cool [16:02:57] arturo: server rebooting now [16:03:00] (03PS4) 10Papaul: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) [16:03:13] it is important to have thos kernels fresh, arturo :-) [16:03:24] it is when they taste better1 [16:03:24] sure [16:03:46] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:03:55] 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Joe) The master DC is a variable, and while in production that's dynamically generated from etcd (more or less), in the compiler is a static value. That was a deliberate choice to decouple... [16:04:07] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:04:14] wikitech is broke'd https://wikitech.wikimedia.org/ [16:04:16] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [16:04:32] (03CR) 10Vgutierrez: [C: 032] Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [16:04:37] Cannot access the database: Cannot access the database: Unknown error [16:04:45] AndyRussG: that's expected [16:04:46] (03PS5) 10Vgutierrez: Partman: Add lvs2007 and lvs2008 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/464584 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [16:04:51] AndyRussG: yes, there is maintenance going on [16:05:12] (03PS4) 10Cwhite: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:05:33] arturo: server up, starting mysql now [16:05:50] ack [16:05:52] just connected indeed [16:06:22] marostegui: arturo ah okok thx! have fun :) [16:06:38] arturo: everything is up [16:06:41] we should be back [16:06:41] are the deploys done? (dunno how I can check it) [16:06:45] socket also in the new location! [16:06:49] marostegui: do I reload the proxy? [16:06:56] yes please [16:07:09] !log reloading haproxy @ dbproxy1005 [16:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:13] XioNoX: there is nowhere to check really except asking in here like you did [16:07:22] I can edit wikitech finely [16:07:25] mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP [16:07:26] AndyRussG: ^ [16:08:07] arturo: let me know if you see all good from your end [16:08:07] !log logged downtime for phabricator in icinga, stopped phd queue processing in preparation for read-only mode [16:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:49] marostegui: apparently yes. I see openstack talking to the DB [16:09:10] great [16:09:24] marostegui: yeah looks good now :) [16:09:30] thx! [16:09:41] good to hear, thanks! [16:11:10] thcipriani: is the deployment done? [16:11:40] I understand no physical cable was unplugged yet, right XioNoX ? [16:11:50] arturo: correct [16:11:52] deployment? I wasn't deploying. [16:11:54] (03PS4) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) [16:12:04] puppet swat [16:12:16] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [16:12:17] oh! no, nobody did puppet SWAT: should be a simple one. [16:12:19] (03PS1) 10Jcrespo: Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 [16:12:28] was mentioning "thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker." [16:12:43] let's do it after the network maintenance :) [16:12:54] ok then [16:13:20] (03CR) 10Marostegui: [C: 031] "+10000!" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [16:13:23] XioNoX: sorry, I misunderstood you. "deployments" :) [16:13:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [16:13:40] (03CR) 10Jcrespo: "With https://phabricator.wikimedia.org/T148507 closed, mariadb is no longer a blocker... although I cannot be sure for other services." [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [16:13:51] no worries, I didn't know how those things work [16:13:57] !log starting asw2-b-eqiad re-cabling - T201039 [16:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:02] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [16:14:21] !log Enable all VC ports on FPC2 and FPC7 - T201039 [16:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:51] should I go read-only now? [16:15:03] (03CR) 10Muehlenhoff: [C: 031] "Looks fine. We don't collect comparable metrics for timesyncd as we did for ISC ntpd (as timesyncd is far more minimalistic) and the ensur" [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:16:02] !log phabricator is read-only [16:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:24] !log Stop and reboot db1072 (phabricator master) for maintenance [16:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:41] tarrow: marostegui thanks [16:16:53] sorry^ I meant twentyafterfour [16:17:00] And not me? :_( [16:17:01] :-) [16:17:06] xd [16:17:12] phab ded? D: [16:17:23] MatmaRex: see SAL [16:17:55] A Troublesome Encounter! [16:17:55] Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [16:18:11] oh, sometimes it loads, readonly indeed [16:18:18] MatmaRex: we're in read-only mode but some requests will still error out [16:18:42] (03CR) 10Herron: [C: 032] "you betcha!" [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [16:19:56] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:20:28] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:20:31] ^expected [16:21:45] !log reloading dbproxy1003,8 [16:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:57] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [16:22:07] it says up on both [16:22:17] as the recovery testifys also [16:22:36] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [16:24:27] (03CR) 10Aaron Schulz: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [16:25:08] !log phabricator is read-write [16:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:32] MatmaRex: note maintenance window has not finished and there may be more interruptions [16:27:26] (03CR) 10Cwhite: [C: 032] standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:27:33] (03PS2) 10Cwhite: standard: remove diamond::collector declaration from standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/464595 (https://phabricator.wikimedia.org/T183454) [16:27:37] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [16:27:47] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [16:30:15] (03PS1) 10Giuseppe Lavagetto: Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 [16:31:29] (03CR) 10Elukey: [C: 031] Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 (owner: 10Giuseppe Lavagetto) [16:33:32] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356 (10Marostegui) [16:33:35] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [16:33:46] !log started phd on phab1001 and re-enabled puppet (I had it disabled to prevent starting phd during read-only) [16:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:49] 10Operations, 10IRCecho, 10Patch-For-Review: Puppet doesn't restart ircecho when the code changes - https://phabricator.wikimedia.org/T205539 (10herron) 05Open>03Resolved a:03herron The above patch was merged (not sure why gerritbot didn't comment about that) Resolving! [16:34:16] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:35:28] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Discovery-Analysis (Current work), 10Patch-For-Review: Add Mathew.onipe(onimisionipe) to deployment group - https://phabricator.wikimedia.org/T205981 (10EBjune) > @EBjune as @Mathew.onipe manager, could you approve this request? Approved, thanks! [16:35:46] PROBLEM - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:36:46] etherpad? [16:37:14] probably something else in there [16:37:46] diamond not found [16:37:50] etherpad1001 puppet-agent[26407]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Package[diamond] is already declared in file /etc/puppet/modules/standard/manifests/diamond.pp:23; cannot redeclare at [16:37:50] /etc/puppet/modules/diamond/manifests/init.pp:69 at /etc/puppet/modules/diamond/manifests/init.pp:69:5 at /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp:32 on node etherpad1001.eqiad.wmnet [16:38:26] was something deployed recently? [16:38:35] shdubsh: think that’s related to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464595/ ? [16:38:41] oh, I see [16:38:45] duplicate require [16:38:51] and not require_package [16:39:19] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Adding also @aaron to get his opinion, no idea about how to trace back what... [16:39:42] !log Enable fpc5-fpc7 - T201039 [16:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:47] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [16:39:48] (03PS5) 10Cwhite: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:41:02] herron, jynus: it should be recovered now [16:41:22] !log Connect/enable fpc2:0/51-fpc5:1/0 (5m DAC) - T201039 [16:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:29] alrighty, indeed looks like most recent puppet run was happy. thanks shdubsh ! [16:42:25] diamond may need a push, however [16:42:48] reset-failed or something (don't understand the context 100%) [16:43:07] ACKNOWLEDGEMENT - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. cole_white diamond is removed. looking into removing from systemd [16:45:44] !log etherpad1001 running systemctl reset-failed [16:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:56] RECOVERY - Check systemd state on etherpad1001 is OK: OK - running: The system is fully operational [16:47:35] (03CR) 10Cwhite: "This latest changeset looks more happy. https://puppet-compiler.wmflabs.org/compiler1002/12776/" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:52:00] !log Step 3) Add missing links - T201039 [16:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:04] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [16:53:50] (03CR) 10Cwhite: [C: 031] Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:54:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:58:10] (03CR) 10Volans: "See inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [16:58:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:59:26] PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: connect to address checker.tools.wmflabs.org and port 80: No route to host [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1700). [17:00:36] RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.012 second response time [17:00:55] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:01:24] (03PS1) 10Herron: admin: add isaacj to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840) [17:02:28] !log tools - published updated toollabs-* Docker images [17:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:15] PROBLEM - Host analytics1061 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:16] PROBLEM - Host analytics1063 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:30] hello [17:03:35] PROBLEM - Host wtp1036 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:36] everything ok? [17:03:45] network maintenance on row b [17:03:46] PROBLEM - Host an-master1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:46] PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:46] PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:46] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:48] ah of course [17:03:50] network maintenance [17:04:06] PROBLEM - Host mwmaint1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:06] PROBLEM - Host ores1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:06] PROBLEM - Host notebook1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:06] PROBLEM - Host restbase-dev1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:16] PROBLEM - Host ms-be1041 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:16] PROBLEM - Host mw1313 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:16] they may be back [17:04:26] PROBLEM - Host mw1318 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:26] PROBLEM - Host mc1025 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host scb1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host mw1290 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host thumbor1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host elastic1036 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host mw1286 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host db1119 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:27] PROBLEM - Host db1113 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:28] RECOVERY - Host mw1313 is UP: PING WARNING - Packet loss = 44%, RTA = 2.71 ms [17:04:28] RECOVERY - Host restbase-dev1005 is UP: PING WARNING - Packet loss = 44%, RTA = 1.16 ms [17:04:36] PROBLEM - Host elastic1038 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:36] PROBLEM - Host kafka1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:36] PROBLEM - Host mw1287 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:36] PROBLEM - Host elastic1039 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:36] RECOVERY - Host ms-be1041 is UP: PING WARNING - Packet loss = 44%, RTA = 0.26 ms [17:04:41] XioNoX: ---^ [17:04:47] he knows [17:04:48] RECOVERY - Host elastic1036 is UP: PING WARNING - Packet loss = 80%, RTA = 0.35 ms [17:04:53] okok [17:04:55] RECOVERY - Host mw1318 is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 58%, RTA = 0.58 ms [17:04:56] RECOVERY - Host scb1002 is UP: PING WARNING - Packet loss = 28%, RTA = 0.36 ms [17:04:56] RECOVERY - Host mw1287 is UP: PING WARNING - Packet loss = 28%, RTA = 0.34 ms [17:04:56] RECOVERY - Host thumbor1001 is UP: PING WARNING - Packet loss = 37%, RTA = 4.57 ms [17:04:56] RECOVERY - Host mw1286 is UP: PING WARNING - Packet loss = 28%, RTA = 0.31 ms [17:04:56] RECOVERY - Host elastic1038 is UP: PING WARNING - Packet loss = 37%, RTA = 0.31 ms [17:04:56] RECOVERY - Host elastic1039 is UP: PING WARNING - Packet loss = 37%, RTA = 0.40 ms [17:04:56] RECOVERY - Host kafka1002 is UP: PING WARNING - Packet loss = 37%, RTA = 0.36 ms [17:04:57] RECOVERY - Host mc1025 is UP: PING WARNING - Packet loss = 37%, RTA = 0.38 ms [17:04:57] RECOVERY - Host mw1290 is UP: PING WARNING - Packet loss = 28%, RTA = 0.89 ms [17:05:06] PROBLEM - HHVM rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:06] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:06] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:15] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed [17:05:15] nse was received [17:05:16] PROBLEM - SSH on elastic1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:16] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:16] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:17] PROBLEM - Nginx local proxy to jobrunner on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:25] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:05:25] PROBLEM - Nginx local proxy to jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:25] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:05:26] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:26] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:27] PROBLEM - SSH on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:35] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:35] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:05:36] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:05:37] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:05:38] PROBLEM - Host mw1314 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:52] is this useful to anyone or shall I disable ircecho in favor of unhandld issues dashboard in icinga [17:05:55] PROBLEM - Host mw1304 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:56] PROBLEM - Host mw1302 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:56] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=tr [17:05:56] re a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was [17:05:56] /v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [17:05:57] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:05:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:06:05] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:05] PROBLEM - Host mc1027 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:05] PROBLEM - Host mw1285 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:05] PROBLEM - Host mw1301 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:06] PROBLEM - Host cp1081 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:06] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:06] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:15] PROBLEM - Host mw1296 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:15] PROBLEM - Host mw1288 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:16] PROBLEM - Host mw1306 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:16] PROBLEM - Host thumbor1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:16] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:06:16] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:17] PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:17] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:17] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:18] PROBLEM - Host cp1082 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:25] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:25] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:06:26] PROBLEM - Host wtp1033 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:27] PROBLEM - Host druid1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:35] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [17:06:37] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron) [17:06:38] are mobileapps alerts supposed to be going off? [17:06:49] and in codfw? [17:06:50] !log stop ircecho on einstenium - alarms shower [17:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:58] everything talks to everything, often in mysterious ways [17:08:21] :S [17:08:23] I'd assume the codfw mobileapps is related to the row B network stuff in eqiad, but it might be hard to track down the causal change [17:08:34] s/change/chain/ [17:09:50] so it might be related to aqs having troubles [17:10:11] (the mobile apps alarms) [17:13:09] 08Warning Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Inbound interface errors [17:13:35] yeah I think it is the chain mobileapps -> aqs -> druid [17:15:46] <_joe_> elukey: why does mobileapps read from aqs? [17:15:51] !log triggering some alerts on labvirt1018 to figure out about alert thresholds [17:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:55] <_joe_> if that's the case, aqs needs to be multi-dc [17:16:29] could be worse, there's only 3 services in that chain :) [17:16:42] <_joe_> can I ask if we know when the maintenance will be over? [17:16:48] we don't yet know [17:16:51] <_joe_> or if there is any update? [17:17:10] the low-level traffic on the maintenance is in -dcops, aside from !log entries here [17:17:26] there's still cabling work ongoing [17:17:46] _joe_ I think it is for some metrics, but I have no control on the clients of course [17:17:48] should be soon if things are stable [17:17:54] I recall that we had as similar problem a while ago [17:18:14] <_joe_> XioNoX: ok, thanks [17:18:33] the original window statement from the ticket was: [17:18:36] the new asw2-b-eqiad that will be impacted by Thursday 4th 16:00UTC 2h maintenance window (with a worse case of a 30min downtime for those hosts, and a best case of no impact). [17:19:29] so from that pov, we've got ~41 minutes left on the maint window, and so far affected hosts have been impacted for 16 of the 30 mins [17:19:42] (03PS2) 10Giuseppe Lavagetto: Fix TLS connections to etcdv3 on stretch [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464602 [17:19:44] (03PS1) 10Giuseppe Lavagetto: New upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464608 [17:20:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] New upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/464608 (owner: 10Giuseppe Lavagetto) [17:21:13] troubleshooting one last link not coming up [17:21:51] we still have FPC6 in some disconnected state, that might not be related to the bad 2-8 link right? [17:22:00] bblack: do we? [17:22:12] !log re-enable ircecho after alarms shower [17:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:15] I think FPC6 looks healthy [17:22:23] I assume since there were host down alerts above that never recovered yet [17:22:42] oh, icinga-wm floodquit or whatever, so irc log isn't reliable [17:23:19] yeah, the FPC disconnect were brief [17:23:37] bblack: I stopped ircecho to avoid the shower of alarms and be able to talk in here [17:24:10] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Inbound interface errors [17:24:55] <_joe_> elukey: bad idea tbh [17:24:56] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:08] <_joe_> it made me think the mc servers were down for half an hour [17:25:50] _joe_ I will not do it again, but to me it is pointless to keep seeing alerts in here and not have a place to talk [17:25:56] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 35, unassigned_shards: 1268, number_of_pending_tasks: 57, number_of_in_flight_fetch: 35, timed_out: False, active_primary_shards: 3178, task_max_waiting_in_queue_millis: 1308459, cluster_name: production-search-eqiad, relocating_shards: 3, active_shards_percent_as_ [17:25:56] we have the icinga ui to check [17:25:56] 58, active_shards: 8241, initializing_shards: 39, number_of_data_nodes: 35, delayed_unassigned_shards: 0 [17:26:07] <_joe_> I get it, but I didn't notice it [17:26:14] sure sure [17:26:16] <_joe_> since we usually don't do it [17:26:39] I had a different worflow/impression (already did it other times), will not do it again :) [17:27:24] <_joe_> you can do it, but maybe do it *before* the shower of alerts happen [17:27:30] well the not-really-followed plan was to use this channel more for logs + spam, and move true conversation elsewhere, but it hasn't really materialized in practice [17:27:37] <_joe_> or people reading the backlog will have a heart attack :P [17:28:06] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:28:12] is this a bad time for me to be deploying parsoid? [17:28:25] RECOVERY - AQS root url on aqs1008 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.018 second response time [17:28:34] <_joe_> arlolra: I fear it is, there is a network mainentenance ongoing in eqiad [17:28:46] ok, will postpone [17:28:47] thanks [17:28:49] <_joe_> your deploy might fail in unpredictable, not-so-funny ways [17:28:53] I think we have light at the end of the tunnel, but not ready to declare it all-ok just yet [17:29:11] we *think* the network has been stable now for a while, and we think there's no more physical changes to make right now [17:29:18] arlolra, ok .. i guess week after switchover. [17:29:24] <_joe_> arlolra: if you can wait some minutes, maybe bblack & co might give you a green light [17:29:50] it's not urgent, better to wait [17:29:52] is anything outstanding in term of row B alerts? [17:30:50] some puppet errors that don't let see any real issue [17:30:54] nothing horrible that I can see in icinga [17:30:56] RECOVERY - puppet last run on db1113 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:30:56] PROBLEM - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [17:31:01] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad Banyek ACK https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [17:32:04] lvs1020: "Servers phab1001-vcs.eqiad.wmnet are marked down but pooled" does that need any action? [17:32:25] <_joe_> lvs1002 [17:32:35] that one [17:32:37] I was going to say, woah that's way more lvses than I remember :) [17:32:41] <_joe_> abd tes [17:32:50] <_joe_> *and yes, that's a real alert [17:32:56] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:33:08] <_joe_> ipv6 to port 22 there seems unreachable from lvs1002 [17:33:12] <_joe_> but not 1005 [17:33:45] <_joe_> bblack: should we talk in -dcops? [17:34:33] !log pool kafka1002 (eventbus) after maintenance [17:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:25] (03PS39) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [17:35:27] (03PS10) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [17:35:56] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:56] RECOVERY - puppet last run on mc1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:56] RECOVERY - puppet last run on kubestage1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:56] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:57] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:57] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:58] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:05] (03CR) 10Volans: "See inline, also there are a bunch of comments to a previous PS that are still valid and un-answered." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:37:04] (03CR) 10Smalyshev: [C: 031] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [17:37:49] alright, the network work is done [17:37:58] let me know if there is any outstanding issue [17:38:17] but the stack has been stable for a bit of time now [17:38:17] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [17:38:44] !log asw2-b-eqiad recabling done - T201039 [17:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:48] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [17:38:56] XioNoX: did you see _joe_'s comment? [17:39:06] not sure if that is network or traffic issue [17:39:07] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 [17:39:27] yeah I'm not sure either [17:39:50] but it seems odd it would be a network error and affect phab1001-vcs and not phab1001 [17:40:22] what's the source/dest of the issue? lvs1002 to ? [17:40:23] hope you can handle it, I was going to disconnect [17:40:38] yes [17:40:38] twentyafterfour may be around althoug that doesn't look service related [17:40:46] XioNoX: lvs1002 <-> phab1001-vcs [17:40:56] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:56] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:40:57] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:41:08] <_joe_> !log uploaded new python-etcd packages for jessie, stretch [17:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:25] phab1001 is row B [17:41:39] but the same machine + same interface is both the phab1001.eqiad.wmnet and phab1001-vcs.eqiad.wmnet IPs [17:42:05] and I can ping both of those IPs from lvs1002 [17:43:17] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1064:9100 job=node site=eqiad Banyek T206245 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [17:45:27] (03PS14) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [17:45:57] (03CR) 10Aezell: [C: 031] Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [17:46:11] oh right, I was being confused by the ipv6 part [17:46:13] (03CR) 10Aezell: [C: 031] Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [17:46:19] can't ping phab1001 or phab1001-vcs ipv6 from lvs1002 [17:46:23] still looking at it [17:46:31] IPs are properly configured on both sides [17:46:36] (03CR) 10Gehel: [C: 032] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [17:47:02] !log repooling labsb1010 (T195747) [17:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:08] yes, and reachable from lvs1005 [17:47:09] T195747: Create views for the schema change for refactored actor storage - https://phabricator.wikimedia.org/T195747 [17:48:29] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm) [17:48:53] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm) [17:49:07] that looks similar to the VC fabric "miss-programming" [17:49:08] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1010 to add initial actor table changes to views" [puppet] - 10https://gerrit.wikimedia.org/r/464610 (owner: 10Bstorm) [17:49:16] that started https://phabricator.wikimedia.org/T201039 [17:49:18] so on the lvs1002 side of this [17:49:57] the ipv6 for private1-b is 2620:0:861:102:1a03:73ff:fef0:8ede on eth1.1018@eth1 with mac 18:03:73:f0:8e:de [17:50:15] phab1001 seems like it has ok ipv6 to other places, the error may be with the lvs1002 row b interface [17:50:24] hard to say though [17:50:44] PROBLEM - Etcd replication lag on conf1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.002 second response time [17:50:52] phab1001 neighbor table has a stale entry: [17:50:57] fe80::1a03:73ff:fef0:8ede dev eth0 lladdr 18:03:73:f0:8e:de STALE [17:51:02] ohrilly [17:51:05] for the fe80 [17:51:11] (03PS1) 10Gehel: Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 [17:51:28] oh and this: [17:51:30] 2620:0:861:102:1a03:73ff:fef0:8ede dev eth0 FAILED [17:51:35] lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:100 eth1.1018 [17:51:35] Soliciting 2620:0:861:102:10:64:16:100 (2620:0:861:102:10:64:16:100) on eth1.1018... [17:51:35] Timed out. [17:51:37] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:45] elukey: do we need to worry about that alert or still no? [17:52:29] indeed, the etcd replication paged, known already? [17:52:31] (03CR) 10Gehel: [C: 032] Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 (owner: 10Gehel) [17:52:35] (03PS40) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [17:52:37] (03PS11) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [17:52:50] lvs1002 has like 8 different FAILED neighor entries [17:52:50] bblack: the fastest fix might be to bounce the network port of one of the servers [17:52:56] it paged earlier and the ack was extended [17:53:02] let's debug a little first since it isn't super critical [17:53:10] maybe there's a software level solution to this, too [17:53:16] (03PS2) 10Gehel: Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 [17:53:20] the server was being installed [17:53:25] maybe it is not yet installed? [17:53:41] (03CR) 10Gehel: [V: 032 C: 032] Revert "wdqs: auto deployment of wdqs on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/464613 (owner: 10Gehel) [17:53:46] I'm guessing we don't have an easy way to target all hardware in row B via cumin right? [17:54:25] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [17:54:27] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.000 second response time [17:54:28] we should, lldp neighbors are a puppet fact IIRC [17:54:35] bblack: almost [17:54:37] give me a sec [17:55:27] bblack: do you have a command I should run? [17:55:58] volans: I'm slogging through now, just targetting everything (which is probably better anyways) [17:56:13] RECOVERY - Etcd replication lag on conf1005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.002 second response time [17:56:38] I cheated btw, I used netbox to select all rowB, export the result (is a csv) selected the column with hostname [17:56:43] and was ready to use that as a target list [17:56:44] I'm not even sure what FAILED means in the ipv6 neighbor table, but I wonder if it's a state that can happen due to a flap, but then gets stuck on the Linux side and just needs clearing [17:56:56] `phab1001:~$ sudo tcpdump -p "host 2620:0:861:102:10:64:16:100 or host fe80::1a03:73ff:fef0:8ede"` [17:56:56] `lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:100 eth1.1018` [17:56:56] bblack ^ phab1001 doesn't see the ND request [17:57:05] <_joe_> ok I am inclined to disable notifications from conf1005 [17:57:15] ah there you are [17:57:32] I was just about to speculate wildly about that host [17:57:34] _joe_: +1 for me if you'll re-enable them later [17:58:28] PROBLEM - puppet last run on labvirt1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:59:04] bblack: in case you need it, this is the search on Netbox (servers in rowB eqiad): [17:59:07] https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_group_id=6&role=server [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1800). [18:00:05] stephanebisson, jynus, and cscott: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:54] PROBLEM - High CPU load on API appserver on mw2204 is CRITICAL: CRITICAL - load average: 89.84, 36.05, 20.00 [18:00:57] I'm still staring/debugging on the phab1001-vcs issue [18:01:03] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:01:04] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hive_schematool_initialize_schema] [18:01:54] RECOVERY - High CPU load on API appserver on mw2204 is OK: OK - load average: 33.09, 29.82, 18.93 [18:01:58] Hi [18:03:35] RECOVERY - puppet last run on labvirt1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:05:56] XioNoX: it half-recovered now.... [18:06:37] state INCOMPLETE now? [18:06:41] well [18:06:56] I can do :8 now, which is the phab1001 ipv65 [18:06:58] err [18:06:59] I can do :8 now, which is the phab1001 ipv6 [18:07:11] but can't ping :100 which is the phab1001-vcs ipv6 [18:07:16] it's all the same interfaces [18:07:34] before, ipv4 was working but ipv6 wasn't. Now ipv4 and 1/2 ipv6 addrs work, but other ipv6 fails [18:07:38] <_joe_> !log disabled notifications for etcd replication lag on conf1005, not in production [18:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:50] is the juniper buggy port thing related to cam tables or whatever juniper calls them? [18:09:01] (switching macaddr tables) [18:09:13] bblack: 2620:0:861:102:10:64:16:100 dev eth1.1018 lladdr 14:18:77:5b:23:4c DELAY [18:09:30] that's on lvs1002? [18:09:42] bblack: so pinging lvs from phab1001-vcs IP, forces lvs to update its cache [18:09:46] neighbor table [18:09:51] I did that though [18:10:16] root@phab1001:~# ping6 -n -I 2620:0:861:102:10:64:16:100 2620:0:861:102:1a03:73ff:fef0:8ede [18:10:25] but if lvs can't send the ND broadcast, then it will expire afer some time [18:10:26] ^ was my ping from the phab1001-vcs IP to lvs1002 [18:10:31] yeah that's why I did too [18:10:32] I ran that many times [18:10:34] what* [18:10:35] ah [18:10:44] bblack, XioNoX: Is the thing you are working on blocking SWAT? [18:10:49] stephanebisson: no [18:11:26] XioNoX: I'm inclined to think this particular issue (and maybe others like it?) are more of a bad linux software reaction to a network blip than anything else [18:11:28] bblack: and now on lvs1002 "2620:0:861:102:10:64:16:100 dev eth1.1018 lladdr 14:18:77:5b:23:4c REACHABLE" [18:11:34] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [18:11:38] fair enough [18:11:39] (bad reaction in the sense of bad neighbor table states, etc) [18:11:53] I guess I'll do the SWAT [18:12:02] in any case, it did just eventually get fixed with no switch-side work, so that says a lot [18:12:04] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:12:29] bouncing the switch port might've fixed it too, but probably indirectly by flushing out everything related on the host's software interface stuff [18:13:03] bblack: I'm still confused on why ND doesn't work for that host [18:13:16] yeah that's what I'm saying [18:13:41] (03PS2) 10Sbisson: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) [18:14:07] I don't know why ND didn't work, but (perhaps a race condition / bug somewhere) went wrong when parts of asw2-b were flapping, and one or both sides of this pair of host got into a confused state on their ipv6 reachability to each other. [18:14:22] but v4 was working the whole time [18:14:28] lvs1002:~$ ndisc6 2620:0:861:102:10:64:16:8 eth1.1018 <- timeout [18:14:28] lvs1002:~$ ndisc6 2620:0:861:4:208:80:155:108 eth3.1004 <- works [18:14:29] so it's not like the interface wasn't passing eth packets [18:14:32] ND still doesn't work [18:14:36] afaik [18:14:39] for that host [18:15:27] either the switch is interfering specifically at that level (with v6 discovery traffic / port-switching stuff), or it's a host-side problem [18:15:41] but ethernet traffic does flow between these macaddrs, for ipv4 [18:16:03] PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:16:19] yeah, it's very specific to ND/multicast to at least some destination [18:16:23] PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:16:45] (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 to add initial actor table changes to views [puppet] - 10https://gerrit.wikimedia.org/r/464619 (https://phabricator.wikimedia.org/T195747) [18:16:50] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson) [18:18:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:18:19] (03Merged) 10jenkins-bot: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson) [18:18:59] yeah [18:19:11] actually almost all ipv6 ND from lvs1002 -> row B seems bad [18:19:21] public vlan too [18:20:44] XioNoX: how about I stop pybal on lvs1002 (failover to 1005), and then we can try bouncing lvs1002:eth1? [18:20:51] bblack: sounds good [18:20:54] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled [18:21:12] !log lvs1002: puppet disabled, stopping pybal (fail to 1005) [18:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:50] going to try software first [18:21:54] ok [18:22:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:23:25] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:464510|Enable PageTriage/ORES on enwiki (T206149)]] (duration: 01m 01s) [18:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:30] T206149: Enable ORES in PageTriage in production - https://phabricator.wikimedia.org/T206149 [18:23:32] yeah no fix [18:23:37] XioNoX: try switch? [18:23:44] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:23:50] ok [18:24:03] PROBLEM - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [18:24:27] !log bounce lvs1002:eth1 switch port [18:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:28] (03PS41) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [18:24:30] (03PS12) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [18:24:34] cscott: Are you around? [18:25:17] bblack: back on [18:25:33] yeah no help [18:25:34] PROBLEM - PyBal connections to etcd on lvs1002 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=10) [18:25:53] (03CR) 10jenkins-bot: Enable PageTriage/ORES on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464510 (https://phabricator.wikimedia.org/T206149) (owner: 10Sbisson) [18:26:14] weird [18:27:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [18:27:15] XioNoX: we could try 1001 if we keep the flap quick? [18:27:22] (err, phab1001) [18:27:36] I donno [18:27:56] I guess lots of integreations there for the deployment that's ongoing too though [18:28:01] cscott: I can deploy your patch if you become available in the next 15 minutes. Just let me know. [18:28:10] (03Abandoned) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [18:28:17] bblack: we can probably leave it as it for now [18:28:17] stephanebisson: i'm around! [18:28:28] stephanebisson: sorry, lost track of swat time [18:28:32] cscott: ok, let's do it [18:28:43] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:28:51] XioNoX: yeah I'm not sure as to the impact really [18:28:53] (03CR) 10Sbisson: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:28:55] minor to be sure [18:28:59] (03PS6) 10Sbisson: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:29:07] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:30:21] it's going to impact anyone connection to git-ssh.wikimedia.org with v6, but maybe most would fall back to v4 (or even start there) [18:31:06] (03Merged) 10jenkins-bot: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:31:48] bblack: but it's currently working, right? [18:31:54] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zuul] [18:32:02] XioNoX: I doubt it! [18:32:06] but maybe? [18:32:44] oh, it is [18:32:52] cscott: You change is on mwdebug2001. Can you test? [18:32:58] your* [18:33:00] I guess, even with ND borked, it's still routing traffic based on the arp of the ipv4 [18:33:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:33:29] ok. can you remind me how to pin my request to a particular mwdebug server? [18:33:33] although I'm not sure I understand how [18:33:36] ipvsadm says: [18:33:38] TCP [2620:0:861:ed1a::3:16]:22 wrr -> [2620:0:861:102:10:64:16:100]:22 Route 10 0 0 [18:34:42] cscott: There's a browser extension called "X-Wikimedia-debug". When you click on it you can select the server (mwdebug2001) and turn it ON. [18:35:42] XioNoX: just file a task, note that public git-ssh over v4 + v6 still appear to be working, ack the alert for now. And we can try bouncing phab1001 port later when there's no deploy windows or anything else going on. [18:36:09] sounds good, what's the alert? [18:37:05] ah, found it [18:37:07] "Servers phab1001-vcs.eqiad.wmnet are marked down but pooled" [18:37:08] the 3x criticals showing on lvs1002 [18:37:26] err wait, those are mine [18:37:41] that's probably why this is working heh [18:37:42] yeah, from pybal [18:37:45] :) [18:38:05] will leave things disabled/stopped on 1002 for now then as part of it [18:38:43] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zuul] [18:40:27] stephanebisson: ok, confirmed that mwdebug2001 is still using Preprocessor_Hash in production (as it should be) [18:40:41] we don't have any machine running php7 yet in prod, do we? [18:41:37] (03CR) 10jenkins-bot: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:43:30] (03CR) 10Volans: [C: 031] "LGTM, let's just wait next week after the switchover to merge as we'll decom the remaining jessie hosts where this should go that would no" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [18:44:53] (03CR) 10C. Scott Ananian: "Deployed and tested on mwdebug2001:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 (owner: 10C. Scott Ananian) [18:45:23] (03PS17) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) [18:46:10] (03PS8) 10Paladox: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) [18:46:42] cscott: re:php7, I don't know :( [18:46:52] cscott: No. [18:46:58] pretty sure we don't. [18:47:25] cscott: Well, yes, but not for user-facing duties. The dump runners are running PHP7 and so is at least one cron-based task (l10nupdate) [18:47:26] so i can't test that we use Preprocessor_DOM on PHP7 machines because there aren't any yet. so i think that's all good. [18:47:48] I have machines running php7 [18:47:53] but that doesn't help you any [18:47:58] they are not set up as app servers [18:48:08] cscott: going live... [18:48:15] they have mediawiki on them but they are closer to being like maintenance servrs [18:48:32] Maybe we should add in a php7-flavoured debug server (mwdebug2003php7omg or whatever). [18:48:43] well, the curl command you'd use to see whether you're running preprocessor_DOM or preprocessor_hash is in the last comment on https://gerrit.wikimedia.org/r/460202 if you wanted to play around on your own [18:49:01] it should get 0 traffic right now if we do that... I think testing in beta is the better way to go for now [18:49:03] James_F: yeah, that's really what i was wondering if we had [18:49:20] Not yet. [18:49:35] bblack: I updated https://phabricator.wikimedia.org/T201039#4643336 we can keep the same task to T/S the lvs/phab issue, would later today be a good time to do that switch port bounce? [18:49:48] stephanebisson: let me know when that's done and I'll perform one more test w/o x-mediawiki-debug [18:49:57] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460202|]] (duration: 00m 59s) [18:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:31] cscott: all done [18:51:02] still getting = in the output aka Preprocessor_Hash as we expect. all good. [18:55:44] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [18:55:46] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206254 [18:56:02] XioNoX: not sure. checking cal [18:56:43] XioNoX: https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_October_04 [18:57:02] so basically, the next opening is ~2h window from 21-23 UTC [18:57:18] ok! [18:57:21] assuming the stuff before finishes on time, we can declare a short "ok phab things might bounce" and bounce the port [18:59:01] (03CR) 10BBlack: [C: 031] "LGTM?" [puppet] - 10https://gerrit.wikimedia.org/r/464563 (https://phabricator.wikimedia.org/T170606) (owner: 10Ottomata) [19:00:04] marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T1900). [19:03:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:08:03] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:14:12] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@6dc89c0]: Bump cirrusSearchLinksUpdate concurrency to 50 [19:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:05] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@6dc89c0]: Bump cirrusSearchLinksUpdate concurrency to 50 (duration: 00m 53s) [19:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:05] train is rolling [19:22:35] (03PS1) 10Dduvall: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 [19:22:37] (03CR) 10Dduvall: [C: 032] all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall) [19:24:12] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall) [19:24:57] (03CR) 10Ayounsi: [C: 032] Add fake ssh keys for netbox user [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [19:25:00] (03CR) 10Ayounsi: [V: 032 C: 032] Add fake ssh keys for netbox user [labs/private] - 10https://gerrit.wikimedia.org/r/464081 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [19:25:10] (03CR) 10Krinkle: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [19:25:17] (03CR) 10Krinkle: [C: 04-1] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [19:26:16] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.24 [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:22] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464631 (owner: 10Dduvall) [19:30:00] seeing quite a rise in fatals [19:30:37] !log rise in fatals "Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.24/includes/Title.php" [19:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:12] possibly something temporary like bytecode cache warmup [19:33:10] (03CR) 10Dzahn: "in the past i have ran into puppet issues when attempting to create a new user and add it to groups in the same patch. i think it's a race" [puppet] - 10https://gerrit.wikimedia.org/r/464605 (https://phabricator.wikimedia.org/T205840) (owner: 10Herron) [19:33:22] marxarelli: yeah, those are known, sadly [19:33:36] no, this is too long i think [19:33:39] it hasn't subsided [19:33:51] rolling back [19:33:52] hmmm [19:34:25] (03PS1) 10Gehel: wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) [19:34:49] marxarelli: https://phabricator.wikimedia.org/T204871 [19:34:51] entire fatalmonitor is filled with those "web request took longer than 60 seconds and timed out" [19:35:27] (03CR) 10Ayounsi: [C: 032] "SSH key added to the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [19:35:49] i see [19:35:51] (03PS3) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) [19:36:07] so the 60 request timeout is new? [19:36:26] or newly fixed? [19:36:38] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10herron) While the optimal approach would be to dedicate new hardware to Kafka in both eqiad and codfw, after a few conversations within the infra team it sounds like a re... [19:36:46] marxarelli: newly fixed [19:36:51] i'll wait on it then [19:37:12] seems to have gone back down [19:37:23] (03CR) 10Mathew.onipe: [C: 031] wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) (owner: 10Gehel) [19:37:23] sort of [19:37:35] (03CR) 10Gehel: [C: 032] wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) (owner: 10Gehel) [19:37:46] (03PS2) 10Gehel: wdqs: wdqs-roots group should exist on all wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/464635 (https://phabricator.wikimedia.org/T205543) [19:39:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:41:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:44:28] 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) [19:47:51] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) a:03Cmjohnson This is m5 master, let's get the disk replaced soon Thanks! [19:48:42] Request from 88.97.96.89 via cp2002 cp2002, Varnish XID 131239612 [19:48:42] Error: 429, Too Many Requests at Thu, 04 Oct 2018 19:48:24 GMT [19:48:51] https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf/page156-716px-The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf.jpg [19:49:00] Busy server? [19:49:07] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) p:05Triage>03High [19:49:31] tooo many requests [19:49:52] The main issue seems to be images not loading at Wikisource [19:50:12] which is a tiresome as it uses image scans for proofreading [19:50:13] ShakespeareFan00, how big is this problem? can you load anything from upload.wm.o at all? [19:50:50] https://upload.wikimedia.org/wikipedia/commons/8/87/Donna_Strickland%2C_OSA_Holiday_Party_2012.jpg [19:50:53] Loaded [19:50:59] i can load images from Wikisource but on the link ShakespeareFan00 gave i can reproduce. [19:51:31] (03CR) 10Dzahn: [C: 031] "lgtm, compiler run looks positive, also that it is already "beta-picked" speaks for it. (if that means the latest PS is picked)" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [19:51:39] so that one particular image [19:51:47] Or that DJVU [19:51:50] any other URLs ShakespeareFan00? [19:52:20] https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf/page14-716px-The_Migration_of_Birds_-_Thomas_A_Coward_-_1912.pdf.jpg [19:52:34] Seem to that PDF specfically [19:52:53] well [19:52:55] open a task [19:53:48] stick #thumbor and #media-storage and #traffic on there [19:53:52] sigh [19:56:14] !log mforns@deploy1001 Started deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 [19:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:00] (03CR) 10Gehel: [C: 04-1] "See comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464570 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [20:02:53] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:03:44] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:06:16] (03PS1) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [20:10:18] !log mforns@deploy1001 Finished deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 (duration: 14m 04s) [20:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:23] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:10:24] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [20:10:33] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:20:04] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.003 second response time [20:21:14] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.060 second response time [20:21:51] (03CR) 10Cwhite: [C: 031] icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:22:59] !log mforns@deploy1001 Started deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 [20:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:10] (03CR) 10Cwhite: [C: 031] icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 (owner: 10Dzahn) [20:23:16] !log mforns@deploy1001 Finished deploy [analytics/refinery@3eb9bf2]: deploying refinery together with refinery-source v0.0.76 (duration: 00m 17s) [20:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:19] (03CR) 10Gehel: [C: 04-1] "This is actually more complex than it looks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:31:33] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) https://netbox.wikimedia.org/api/dcim/devices/1945/napalm/ > NAPALM is not installed. Please see the documentation for instructions. While ```lang=bash netmon1002:/srv/deployment/netbo... [20:32:26] (03CR) 10Dzahn: [C: 04-1] "will now wait until after eqiad switch back and mwmaint1002 is confirmed working" [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [20:42:04] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [20:43:04] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [20:45:03] (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:49:37] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) I sent an email to Ellie to follow-up. [20:49:49] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) a:03Dzahn [20:53:21] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) a:05herron>03Dzahn [20:56:03] 10Operations, 10Cloud-Services, 10Mail, 10User-herron: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) p:05Triage>03Normal [21:05:44] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 67, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:06:02] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) Alright. Done! I added this as an example on wikitech: https://wikitech.wikimedia.org/wiki/Mailman#Real_world_exa... [21:08:25] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) 05Open>03Resolved added this to list description: //This list has been disabled in favor of wikitech-l in http... [21:14:23] (03CR) 10Gergő Tisza: [C: 031] Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [21:14:44] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10Dzahn) @Psychoslave Hi, is this ticket resolved or do you have further questions on the new list? Looks to me you are admin on https://lists.wikimedia.org/mail... [21:15:09] (03CR) 10Gergő Tisza: [C: 031] Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [21:16:24] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 108 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:19:50] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) I see.. hmm. yea, then we should buy a replacement disk. [21:20:03] XioNoX: ready to try the phab1001 port? [21:20:20] bblack: almost done with the DR call [21:20:36] bblack: link came up 10min into the meeting [21:21:25] (03PS7) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [21:21:37] ok [21:22:31] (03PS8) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [21:22:33] bblack: tldr for the equinix link, they connected it to the wrong port [21:23:04] still have to figure out who owns the part of the circuit that goes into UL space, [21:23:34] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 [21:23:36] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 [21:23:41] bblack: alright, let's look at phab1001 [21:24:58] (03CR) 10Cwhite: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [21:25:01] Krenair: ShakespearFan00: re: 429s on thumbs, there's oustanding tickets, it's a relatively-known issue: T175512 T151202 T203135 [21:25:02] T175512: thumbor 429 throttled error messages are confusing - https://phabricator.wikimedia.org/T175512 [21:25:03] T203135: ThumbnailRender job fails with 429 errors - https://phabricator.wikimedia.org/T203135 [21:25:03] T151202: 429 Error generating thumbnail - https://phabricator.wikimedia.org/T151202 [21:25:15] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) For what it's worth, the aliases in eqiad1 can be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464722/ or something like it. [21:26:05] all: there will be a very brief blip of traffic to all phabricator services! [21:26:36] XioNoX: log the port action here I guess and give it a swing! [21:26:48] bblack, ok, thanks [21:27:46] !log bounce phab1001 switch port - T201039 [21:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:51] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [21:28:33] done [21:29:00] [Thu Oct 4 21:28:27 2018] tg3 0000:02:00.0 eth0: Link is down [21:29:00] [Thu Oct 4 21:28:53 2018] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [21:29:11] hum, still no ND [21:29:13] still no ping6 to lvs1002 ephemeral v6 [21:29:17] from lvs [21:30:48] works the other way around though [21:30:59] (ndisc6 for lvs1002's ipv6 works from phab1001) [21:31:05] yeah [21:32:09] so this means basically, probably, that multicast/broadcast v6 stuff isn't making it into phab1001's port, or isn't making it into lvs1002's port, I think? [21:32:32] I wonder if we can try some other fake multicast traffic to see if either of those are true, and sniff [21:32:57] sorry the earlier line should've said: isn't making it into phab1001's port, or isn't making it out of lvs1002's ports [21:33:19] it's the multicast that 1002 sends to solicit phab1001 that gets no answer [21:33:34] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10eyoung) As Daniel offered above, I need the passwords . Can you reset it for me and I'll deal with it then, getting Irene on as new administrator as well. Thanks All! https:... [21:33:35] yeah, never makes it to phab1001 [21:34:04] I wonder if the same flow is blocked for some other kind of multicast v6 [21:34:13] (and to/from other hosts than that pair) [21:34:19] it's a very odd problem :P [21:35:31] we could install ndisc6 on all hosts and script some cumin, but that looks overkill [21:36:04] I have some notes/commands from a previous meeting with JTAC, will try to re-understand what they mean [21:36:31] basically to track a route to a specific mac inside the fabric [21:37:13] I ran through a global "ip -6 neigh show|grep FAIL" earlier FWIW [21:37:35] 10% of all hosts have at least some FAILED entries, and most aren't related to asw2-b [21:38:00] so I don't think it's necessarily uncommon, and it could be we have some general edginess in our ipv6 setup that whatever this is, is just making worse [21:45:07] 10Operations, 10Wikimedia-Mailing-lists, 10User-Urbanecm: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10Dzahn) 05Open>03Resolved a:03Dzahn I logged in on the admin interface using the master password from pwstore. From there i followed the link to the archive... [21:54:56] (03CR) 10Alex Monk: [C: 031] labs-ip-alias-dump.py: remove an unused variable [puppet] - 10https://gerrit.wikimedia.org/r/464721 (owner: 10Andrew Bogott) [21:55:24] (03CR) 10Alex Monk: [C: 031] labs-ip-alias-dump.py: Fix enumerating IPs in Neutron [puppet] - 10https://gerrit.wikimedia.org/r/464722 (owner: 10Andrew Bogott) [21:55:45] bblack: I ran the same troubleshooting commands that I had from JTAC, but don't see the same symptoms there [21:56:05] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) The request was to also remove archives. This changed things and had to use "rmlist -a" instead. ``` root@fermium... [21:56:20] next step is to open a jtac ticket I guess [22:11:05] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) ``` [fermium:~] $ list_name="wikimania-com" ; sudo /var/lib/mailman/bin/change_pw -l $list_name -p $(pwgen -c1 -s 12) New wikimania-com password: [fermium:~]... [22:18:49] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) 05Open>03Resolved Tentatively calling it resolved. If anything is not working as expected please let us know and we will reopen this right away. [22:43:53] RECOVERY - ElasticSearch shard size check on search.svc.codfw.wmnet is OK: OK - All good! [22:52:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:53:58] 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) I'm afraid we have exhausted our options here. It's a thing the list admins need to agree and change if they want to. They have been pinged... [22:56:22] (03PS1) 10Nuria: Rotate logs in refinery based on time rather than size [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020) [22:56:53] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:56:53] 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) The right venue for this issue is still emailing the list owners at: **wikimedia-l-owner@lists.wikimedia.org** Or if that fails, wikimedia-l... [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181004T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:03:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen