[00:00:36] (03CR) 10jenkins-bot: Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [00:01:56] (03CR) 10jenkins-bot: Test LoginNotify on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878) (owner: 10Niharika29) [00:02:49] (03CR) 10jenkins-bot: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) (owner: 10DatGuy) [00:03:41] Niharika: so once https://integration.wikimedia.org/ci/job/beta-scap-eqiad/149291/console completes your LoginNotify patch should be live on beta [00:04:01] thcipriani: Cool! [00:16:57] thcipriani: Hmm, I don't see the extension on https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Version Did I mess up something in the patch? [00:17:16] Nor on https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version [00:17:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:18:58] (03CR) 10Dzahn: [C: 032] Remove Apache across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis) [00:19:16] Niharika: hrm, doesn't look like your patch is there just yet...lemme see if I can manually deploy it [00:19:26] thcipriani: Okay. [00:22:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:24:28] (03CR) 10Dzahn: [C: 032] aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [00:25:13] Niharika: something weird with the git fetch/rebase on deployment-tin, once this is done it will be there: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/149293/console [00:25:17] sorry about that :( [00:25:32] thcipriani: No problem. Thanks for fixing it! [00:27:50] (03PS3) 10Dzahn: aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [00:30:25] (03PS4) 10Dzahn: aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [00:39:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:44:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:45:49] !log install1002/2002: sudo -i reprepro --delete clearvanished to remove precise distro after merging gerrit:345550 [00:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:00] (03CR) 10Dzahn: "[install1002:/srv/wikimedia] $ sudo -i reprepro list apache" [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [00:50:40] (03CR) 10Dzahn: "install2002: sudo reprepro --delete clearvanished" [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [00:56:40] (03PS3) 10Dzahn: Add Bytemark to public_mirrors.html list [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [00:58:03] (03CR) 10Dzahn: [C: 032] Add Bytemark to public_mirrors.html list [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [00:58:48] (03CR) 10Dzahn: [C: 032] hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [01:00:12] (03CR) 10Dzahn: "add something to "Location" column?" [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [01:01:19] (03PS4) 10Dzahn: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [01:01:54] (03PS5) 10Dzahn: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [01:05:28] (03CR) 10Reedy: "How did I miss that? :/" [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [01:08:07] (03PS1) 10Dzahn: dumps: add location to Bytemark (UK) mirror [puppet] - 10https://gerrit.wikimedia.org/r/346226 [01:08:36] (03PS1) 10Reedy: Add location of Bytemark mirror [puppet] - 10https://gerrit.wikimedia.org/r/346227 (https://phabricator.wikimedia.org/T159331) [01:08:38] snap [01:09:18] hehe, you were probably also waiting for gerrit to take it [01:10:00] (03CR) 10Dzahn: [C: 032] dumps: add location to Bytemark (UK) mirror [puppet] - 10https://gerrit.wikimedia.org/r/346226 (owner: 10Dzahn) [01:10:09] (03PS2) 10Dzahn: dumps: add location to Bytemark (UK) mirror [puppet] - 10https://gerrit.wikimedia.org/r/346226 [01:10:15] (03CR) 10Dzahn: [V: 032 C: 032] dumps: add location to Bytemark (UK) mirror [puppet] - 10https://gerrit.wikimedia.org/r/346226 (owner: 10Dzahn) [01:10:26] (03Abandoned) 10Dzahn: Add location of Bytemark mirror [puppet] - 10https://gerrit.wikimedia.org/r/346227 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [01:14:57] (03PS2) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:15:57] (03CR) 10Dzahn: "are we waiting until actual EOL?" [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:16:01] (03CR) 10jerkins-bot: [V: 04-1] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:18:46] (03PS3) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:18:48] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:19:08] (03CR) 10Dzahn: "PS3: fixed lint warning that made jenkins-bot -1" [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:19:28] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:26:44] (03PS1) 10Reedy: Disable LoginNotify on wikis that don't have Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346228 (https://phabricator.wikimedia.org/T158878) [01:27:01] greg-g: ^ Mind if I push that? Only affects InitialiseSettings-labs [01:29:14] (03CR) 10Reedy: [C: 032] Disable LoginNotify on wikis that don't have Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346228 (https://phabricator.wikimedia.org/T158878) (owner: 10Reedy) [01:29:38] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 79925.430035 Seconds [01:29:48] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 79933.183203 Seconds [01:30:23] (03Merged) 10jenkins-bot: Disable LoginNotify on wikis that don't have Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346228 (https://phabricator.wikimedia.org/T158878) (owner: 10Reedy) [01:30:37] (03CR) 10jenkins-bot: Disable LoginNotify on wikis that don't have Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346228 (https://phabricator.wikimedia.org/T158878) (owner: 10Reedy) [01:31:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 80759.079548 Seconds [01:31:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 80760.148934 Seconds [01:31:28] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 80761.051578 Seconds [01:31:30] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: Disable LoginNotify on wikis that have no Echo T158878 (duration: 00m 44s) [01:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:37] T158878: Test LoginNotify Extension on Beta Cluster - https://phabricator.wikimedia.org/T158878 [01:32:38] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 80105.167091 Seconds [01:35:31] Reedy: expo facto don't mind :) [01:35:53] Now prod doesn't shit itself because -labs isn't sync'd ;) [01:36:03] *is [01:36:12] :) [01:39:28] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:42:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81419.394763 Seconds [01:43:28] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:43:28] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:46:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81659.472345 Seconds [01:46:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:47:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:28] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:47:48] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:50:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 81900.30285 Seconds [01:54:05] (03CR) 10Aude: [C: 031] Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 (owner: 10Hoo man) [01:54:38] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 27.964003 Seconds [01:54:38] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 27.966124 Seconds [01:54:48] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 35.808315 Seconds [01:55:28] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 56.283486 Seconds [01:55:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 57.210154 Seconds [01:55:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 58.137525 Seconds [02:11:28] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:25:08] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3152922 (10mmodell) @MoritzMuehlenhoff AFAIK, Phabricator doesn't handle bounces at all and it doesn't handle SMTP envelope rejections very gracefully. Essentially phabricator keep... [02:26:35] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3152924 (10mmodell) Also AFAIK, @wikimedia.org email accounts of former staff get disabled at which time they refuse delivery at the SMTP level. [02:29:00] (03Abandoned) 1020after4: SemanticForms -> PageForms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327307 (owner: 1020after4) [02:31:40] (03CR) 1020after4: [C: 031] Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [02:34:19] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 14m 27s) [02:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:29] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:39:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 4 02:39:47 UTC 2017 (duration 5m 28s) [02:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:28] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:07:49] (03CR) 10Krinkle: [C: 031] l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [03:16:58] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:41:56] (03PS1) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) [03:42:07] (03PS2) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) [03:42:49] (03PS3) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) [03:45:58] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:57:18] (03PS4) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) [03:57:41] (03PS5) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) [04:19:28] (03CR) 10Krinkle: "Please do scrutinise my meagre attempt at writing Python." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) (owner: 10Krinkle) [04:23:28] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:28:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:33:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:37:00] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3152956 (10bbogaert) Hi @MoritzMuehlenhoff , >>! In T161004#3150359, @MoritzMuehlenhoff wrote: > ... > @bbogaert: If OIT offboards a staff member, does the @wikimedia.org contin... [04:48:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 7106.271459 Seconds [04:48:28] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 7106.405852 Seconds [04:48:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 7106.41168 Seconds [04:49:28] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:49:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:49:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [04:52:28] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:01:00] (03PS1) 10Dzahn: nagios_common: enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 [05:15:09] (03PS2) 10Dzahn: nagios_common: fi/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [05:19:01] (03CR) 10Dzahn: [C: 04-1] "@Muehlenhoff this is still a little bit WIP, somehow i could not compile it earlier but error is probably unrelated. anyways, you get the " [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [05:19:27] (03PS3) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [05:21:17] (03PS4) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [05:26:58] (03PS5) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [05:33:57] (03PS6) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [05:41:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3163620 keys, up 11 days 13 hours - replication_delay is 626 [05:43:28] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:44:48] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3145045 keys, up 11 days 13 hours - replication_delay is 13 [05:45:08] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3152986 (10Gehel) [05:45:17] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3152999 (10Gehel) p:05Triage>03High [05:52:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346238 (https://phabricator.wikimedia.org/T160390) [05:54:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 613 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3145045 keys, up 11 days 13 hours - replication_delay is 613 [05:57:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346238 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [05:57:38] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:58:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346238 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [05:58:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346238 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:01:58] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3144551 keys, up 11 days 13 hours - replication_delay is 24 [06:05:48] (03CR) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [06:05:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346239 [06:06:07] (03PS11) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [06:07:30] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [06:07:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346239 (owner: 10Marostegui) [06:09:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346239 (owner: 10Marostegui) [06:09:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346239 (owner: 10Marostegui) [06:09:14] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153014 (10Steinsplitter) @ema will this be fixed soon? If not i have to fix stuff & update the MediaWiki message o... [06:25:38] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:27:46] !log Deploy schema change db1015 (s3) - https://phabricator.wikimedia.org/T159319 [06:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:38] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:18] (03PS1) 10Marostegui: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346243 (https://phabricator.wikimedia.org/T160390) [06:32:19] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346243 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:33:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346243 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:33:43] (03CR) 10jenkins-bot: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346243 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:34:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2068 - T160390 (duration: 00m 44s) [06:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:39] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:35:56] !log Deploy schema change db2068 (s7) - T160390 [06:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:25] (03CR) 10Muehlenhoff: "There's no real point in making this Hiera-configurable, this is just a temporary test setup and when the tests are completed, it'll be ap" [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [06:43:54] !log Deploy alter table on db2019 (codfw s4 master) - this will generate lag on codfw for s4 - T161683 [06:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:01] T161683: Remove partitioning from db2019 (codfw master) commonswiki.templatelinks - https://phabricator.wikimedia.org/T161683 [06:48:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:53:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:55:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:57:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:57:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [07:00:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:02:10] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 (owner: 10Hoo man) [07:02:38] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:07:55] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3153092 (10Nemo_bis) [07:08:29] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Nemo_bis) (Fixed summary to reflect the "original" bug repor... [07:12:48] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:17:05] (03PS12) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [07:18:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:20:34] (03PS13) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [07:21:28] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:24:54] (03PS1) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) [07:26:28] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:27:55] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153103 (10Nemo_bis) >>! In T76560#807460, @Nirzar wrote: > We were trying to populate this spread sheet with common errors > https://docs.google.com/a/wikim... [07:29:15] <_joe_> mobrovac_: back in my TZ or up at ungodly hours? [07:32:24] (03PS14) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [07:35:12] !log reimage analytics103[234] to Debian Jessie [07:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:54] 06Operations, 07HHVM, 10Wikimedia-General-or-Unknown: HHVM and PCRE v8.31 gives incorrect results for certain PCRE patterns - https://phabricator.wikimedia.org/T73922#3153119 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff We're using Debian jessie for a while now (which has PCRE 8.35) an... [07:40:48] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:45:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:48:58] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:50:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:52:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:54:08] !log rebooting bast2001 to Linux 4.9 [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:38] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:57:34] 06Operations, 10Traffic: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153143 (10ema) [07:57:46] 06Operations, 10Traffic: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153132 (10ema) p:05Triage>03Normal [07:57:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:06:14] (03PS1) 10Giuseppe Lavagetto: Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 [08:09:16] (03PS1) 10Muehlenhoff: Fix date calculation for accounts with expiry date [puppet] - 10https://gerrit.wikimedia.org/r/346254 [08:13:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346258 [08:13:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346258 [08:15:46] (03PS2) 10Muehlenhoff: Fix date calculation for accounts with expiry date [puppet] - 10https://gerrit.wikimedia.org/r/346254 [08:16:58] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:17:05] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153160 (10Nemo_bis) [08:17:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346258 (owner: 10Marostegui) [08:18:23] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153163 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1032.eqiad.wmnet', 'analytics1033.... [08:18:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346258 (owner: 10Marostegui) [08:18:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346258 (owner: 10Marostegui) [08:19:03] (03CR) 10Muehlenhoff: [C: 032] Fix date calculation for accounts with expiry date [puppet] - 10https://gerrit.wikimedia.org/r/346254 (owner: 10Muehlenhoff) [08:19:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1015 - T159319 (duration: 00m 45s) [08:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:38] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153164 (10Nemo_bis) [08:24:12] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153178 (10ema) >>! In T161517#3153014, @Steinsplitter wrote: > @ema will this be fixed soon? If not i have to fix... [08:27:09] 06Operations, 10Traffic: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153180 (10ema) Oh and apparently the repeated USB messages have been reported already in T148017. [08:30:48] (03CR) 10Nemo bis: "It's necessary to tell users and translators how they can get the errors use their language now. Will you take care of that (in particular" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [08:34:23] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153187 (10Nemo_bis) This is the only pending question, isn't it? >>! In T161517#3140732, @Krinkle wrote: > If we... [08:44:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:47:41] !log rebooting mw1265 to Linux 4.9 [08:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:52] (03PS1) 10Nemo bis: Make Wikipedia link on 404 page language-agnostic via Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346264 (https://phabricator.wikimedia.org/T113114) [08:49:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:50:50] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153194 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1032.eqiad.wmnet', 'analytics1033.eqiad.wmnet', 'analytics1034.eqiad.wmnet'] ``` Of... [08:52:53] (03CR) 10Hoo man: "Do we have any chance to know the project here, if so, you could use https://www.wikidata.org/wiki/Special:GoToLinkedPage/enwiki/Q208219" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346264 (https://phabricator.wikimedia.org/T113114) (owner: 10Nemo bis) [08:56:05] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:10:24] !log restarted swiftrepl (repl_all.sh loop) on ms-fe1005 [09:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:16:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:24:12] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:27:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:27:23] 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3153254 (10Volans) [09:29:50] 06Operations, 10media-storage: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123#3153268 (10Volans) [09:40:13] !log rebooting wtp1001 to Linux 4.9 [09:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:45:01] (03PS1) 10Hoo man: Add ll to my bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/346270 [09:49:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:53:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:54:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:58:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:14:26] 06Operations, 10Traffic: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153340 (10ema) [10:14:47] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10ema) p:05Triage>03Normal [10:15:08] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:08] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:45:58] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3150049 keys, up 11 days 18 hours - replication_delay is 621 [10:46:42] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3153376 (10elukey) >>! In T125735#3152656, @aaron wrote: > In $wmgRedisQueueBaseConfig in wmf-config/jobqueu... [10:50:11] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10jcrespo) [10:55:38] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153405 (10jcrespo) Origin ips (under NDA): {P5199} The queries done are: ``` ?format=json&action=parse&page=[*title*]&prop=tex... [10:58:08] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3137604 keys, up 11 days 18 hours - replication_delay is 0 [11:01:15] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10TheDJ) I don't think that the purge was complete. This one h... [11:02:38] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 [11:02:48] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 [11:02:58] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 [11:02:58] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 [11:02:58] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 [11:03:08] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 [11:03:18] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 [11:03:23] that's me, looking ^ [11:03:28] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 [11:05:28] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:25] cp3003 did reboot into 4.9 but eth0 is marked as down [11:16:05] ACKNOWLEDGEMENT - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% Ema eth0 issues upon reboot into 4.9, host depooled [11:30:25] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3146429 (10akosiaris) Something is rotten in the state of maps-cleartables ``` akosiaris@maps-clea... [11:31:17] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346277 [11:31:22] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346277 [11:36:44] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346277 (owner: 10Marostegui) [11:37:57] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346277 (owner: 10Marostegui) [11:38:06] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346277 (owner: 10Marostegui) [11:39:11] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2068 - T160390 (duration: 00m 58s) [11:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:20] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [11:39:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346278 (https://phabricator.wikimedia.org/T160390) [11:39:54] (03PS3) 10Muehlenhoff: Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [11:42:48] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:44:06] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346278 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [11:45:24] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346278 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [11:46:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2061 - T160390 (duration: 00m 44s) [11:46:23] !log Deploy schema change db2061 (s7) - T160390 [11:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [11:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:37] (03CR) 10jenkins-bot: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346278 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [11:50:37] 06Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153465 (10ema) [11:50:45] 06Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153480 (10ema) p:05Triage>03Normal [11:51:25] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3153481 (10Gehel) @akosiaris Thanks for looking into this! How did I end up with puppet isntalled a... [11:52:30] ACKNOWLEDGEMENT - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:30] ACKNOWLEDGEMENT - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:52:31] ACKNOWLEDGEMENT - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3003_v4, cp3003_v6 Ema https://phabricator.wikimedia.org/T162132 [11:53:31] (03PS1) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [11:53:51] !log reimage analytics10[36,37,38] to Debian Jessie [11:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:20] (03CR) 10Marostegui: "@jcrespo, any objection to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [11:58:15] (03CR) 10Jcrespo: [C: 031] "We also need to deploy the prometheus mysql exporter deletion too." [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [11:58:44] !log installing e2fsprogs update from jessie point update [11:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:05] (03PS4) 10Marostegui: site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) [12:01:26] (03CR) 10Marostegui: "> We also need to deploy the prometheus mysql exporter deletion too." [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [12:01:46] (03PS5) 10Marostegui: site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) [12:08:20] (03CR) 10Marostegui: [C: 032] site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [12:08:26] (03PS1) 10ArielGlenn: add table type to flagged revs table config file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/346280 [12:09:16] (03PS2) 10ArielGlenn: add table type to flagged revs table config file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/346280 [12:10:23] (03CR) 10ArielGlenn: [C: 032] add table type to flagged revs table config file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/346280 (owner: 10ArielGlenn) [12:10:48] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:13:31] 06Operations, 10ops-eqiad, 10DBA: Decommission db1057 - https://phabricator.wikimedia.org/T162135#3153532 (10Marostegui) [12:13:41] 06Operations, 10ops-eqiad, 10DBA: Decommission db1057 - https://phabricator.wikimedia.org/T162135#3153550 (10Marostegui) p:05Triage>03Normal [12:16:57] 06Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153567 (10ema) I've tried a "cold reboot" with `racadm serveraction powerdown ; racadm serveraction powerup` to no avail. [12:19:18] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153569 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1036.eqiad.wmnet', 'analytics1037.... [12:19:49] !log upgrade cp2003 to linux 4.9 T162029 [12:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:57] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [12:23:18] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:59] yay cp2003 made it! :) [12:46:25] Reedy: have you got a minute to answer a few questions for me? [12:46:36] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [12:46:44] (03PS8) 10Alexandros Kosiaris: url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [12:46:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [12:48:34] anyone know where shinken-wm irc bot code can be found (what repo) and where could i find the same for icinga-wm ? [12:51:14] hashar: nothing for eu swat today [12:51:54] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3153620 (10Gehel) 05Open>03Resolved a:03Gehel Some other trouble, but this specific issue is... [12:52:18] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T1300). [13:00:14] !log cache_upload: ban all objects with content-type ~ "^text" T162035 [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:22] T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 [13:00:53] (03PS1) 10Addshore: Enable interwikisorting on BETA wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346283 [13:01:40] (03PS4) 10Alexandros Kosiaris: Revert "Revert "Add the LVS blocks to url_downloader"" [puppet] - 10https://gerrit.wikimedia.org/r/207490 [13:02:30] o/ (I just added 1 patch to swat) [13:03:03] addshore: question, in regex is it * or . for any char [13:03:27] . [13:03:51] so for example addshore foo. [13:04:53] (03CR) 10Addshore: [C: 032] Enable interwikisorting on BETA wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346283 (owner: 10Addshore) [13:05:07] !log installing ca-certificates updates from jessie point update [13:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:14] (03Draft2) 10Zppix: Adding a few more typos that could break things if they aren't tested for. [puppet] - 10https://gerrit.wikimedia.org/r/346282 [13:05:59] (03Merged) 10jenkins-bot: Enable interwikisorting on BETA wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346283 (owner: 10Addshore) [13:06:25] (03CR) 10jenkins-bot: Enable interwikisorting on BETA wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346283 (owner: 10Addshore) [13:06:42] (03PS2) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [13:08:28] (03PS3) 10Zppix: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:08:55] Zppix: It's generally advised not to touch other peoples patches when they're actively working on them [13:09:22] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6014/aluminium.wikimedia.org/ says it's fine. So 2 years after the first submit this is finally resubm" [puppet] - 10https://gerrit.wikimedia.org/r/207490 (owner: 10Alexandros Kosiaris) [13:09:50] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY [[gerrit:346283|Enable interwikisorting on BETA wiktionaries]] (duration: 00m 44s) [13:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:56] Reedy: ack [13:09:59] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153643 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1038.eqiad.wmnet'] ``` The log can... [13:10:05] afaik thats swat doen then...! [13:10:07] *done [13:11:04] !log add LVS IPs to the url-downloader blacklist now that all nodejs services no longer require it anymore. See https://gerrit.wikimedia.org/r/207490 [13:11:07] does anyone know where the icinga-wm and shinken-wm repos are for the irc codes [13:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:17] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:12:37] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:12:37] PROBLEM - restbase endpoints health on cerium is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:12:37] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:12:57] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:12:57] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:04] Zppix: It's just ircecho it seems https://github.com/wikimedia/puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/shinken/manifests/ircbot.pp#L12 [13:13:07] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:07] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:17] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:17] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:17] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:27] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:27] PROBLEM - restbase endpoints health on xenon is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:27] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:27] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:27] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:28] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:28] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:13:29] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:15:01] akosiaris: ^ Is that your change/ [13:15:27] Reedy: that would be funny [13:15:32] but maybe [13:15:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:16:18] Just with restbase using node stuff.. [13:16:18] I 'll revert just for good measure and let's see later [13:16:40] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Revert "Add the LVS blocks to url_downloader""" [puppet] - 10https://gerrit.wikimedia.org/r/346285 [13:16:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Revert "Revert "Add the LVS blocks to url_downloader""" [puppet] - 10https://gerrit.wikimedia.org/r/346285 (owner: 10Alexandros Kosiaris) [13:18:17] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:18:17] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:18:17] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [13:18:17] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [13:18:18] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:18:18] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:18:27] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [13:18:27] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:18:27] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:18:27] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:18:27] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:18:31] rofl [13:18:33] Reedy: yeah definitely [13:18:37] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:18:37] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [13:18:37] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:18:37] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [13:18:46] so why on earth does restbase use url-downloader? [13:18:47] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:18:48] mobrovac_: ^ ? [13:18:57] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [13:19:07] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [13:19:07] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:19:09] "Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200)" [13:19:21] That suggests it uses en.wikipedia.org [13:19:29] Rather than the service pool, sending a host header etc [13:20:18] maybe [13:20:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:21:23] Reedy: it's fine connecting to the LVS IP. It's just that it should do it directly, not via url-downloader [13:21:29] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Revert "Revert "Add the LVS blocks to url_downloader"""" [puppet] - 10https://gerrit.wikimedia.org/r/346287 [13:21:53] 4 reverts already... let's hope it will not take another 2 years for restbase to stop using url-downloader [13:23:59] Reedy: actually by the citation part I guess that's citoid ? [13:24:15] oh please tell me it's not restbase calling citoid calling restbase to provide a citation [13:24:26] 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3153254 (10faidon) You can kill the two thumbs if you want to move past it, as killing thumbs is almost always a safe operation. That said, there is probably an underlying bug that resu... [13:24:28] Possibly. I really don't know that much about the services stuff [13:24:55] akosiaris: We've seen worse ideas around here [13:25:10] true [13:25:19] ah the stories we have to tell [13:26:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:28:17] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153673 (10jcrespo) Seems to have stopped for now since 12:34 UTC: https://grafana.wikimedia.org/dashboard/db/api-summary?panelId... [13:31:13] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [13:31:33] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:31:51] ema: --^ [13:32:08] elukey: yeah thanks :) [13:32:16] seems to be over already [13:32:17] :) [13:32:26] (see #-traffic) [13:32:43] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:35:40] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153680 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1038.eqiad.wmnet'] ``` and were **ALL** successful. [13:38:43] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:39:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:47:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:48:20] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153759 (10jcrespo) 05Open>03stalled [13:48:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:48:23] PROBLEM - Check systemd state on analytics1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:49:03] PROBLEM - Check systemd state on analytics1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:52:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:53:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:54:41] checking 1037 and 1036 (just reimaged) [13:58:23] RECOVERY - Check systemd state on analytics1037 is OK: OK - running: The system is fully operational [13:58:39] weird, puppet.service was failed for SIGTERM [13:59:03] RECOVERY - Check systemd state on analytics1036 is OK: OK - running: The system is fully operational [13:59:13] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:04:35] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1004 & 5 refresh - https://phabricator.wikimedia.org/T161754#3153870 (10chasemp) [14:06:05] !log reimage analytics1039 and 1051 to Debian Jessie [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] (03CR) 10Andrew Bogott: [C: 031] nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) (owner: 10Dzahn) [14:06:33] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 578044 [14:10:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:14:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:15:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:18:23] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 594554 [14:19:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:21:34] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153884 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1039.eqiad.wmnet', 'analytics1051.... [14:22:25] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153887 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1039.eqiad.wmnet', 'analytics1051.... [14:26:33] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 470 [14:27:34] !log rebooting cerium to Linux 4.9 [14:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:26] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 54006 [14:34:16] !log rebooting xenon to Linux 4.9 [14:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3153926 (10Andrew) I shutdown a failed instance and mounted the drive. ``` andrew@labvirt1008:/tmp/mnt/var/lib/dhcp$ ls dhclient.eth0.leases dhclient.... [14:38:07] 06Operations, 10ops-esams, 10netops: esams higher than usual temperature - https://phabricator.wikimedia.org/T162152#3153928 (10faidon) [14:39:32] !log rebooting praseodymium to Linux 4.9 [14:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] moritzm hi, will linux 4.9 be available for labs too? [14:46:02] paladox: it's already available via the "linux-meta-4.9" package on apt.wikimedia.org, will talk to labs team to use it by default for jessie images soon [14:46:14] ok thanks [14:46:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:47:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:50:29] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3153961 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1039.eqiad.wmnet', 'analytics1051.eqiad.wmnet'] ``` Of which those **FAILED**: ```... [14:51:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:52:23] (03PS1) 10Volans: PuppetDB backend: consistently use InvalidQueryError [software/cumin] - 10https://gerrit.wikimedia.org/r/346301 (https://phabricator.wikimedia.org/T162151) [14:52:25] (03PS1) 10Volans: PuppetDB backend: forbid resource's parameters regex [software/cumin] - 10https://gerrit.wikimedia.org/r/346302 (https://phabricator.wikimedia.org/T162151) [14:53:01] (03PS1) 10Giuseppe Lavagetto: mediawiki: use mw_primary for jobrunner, cronjobs state [puppet] - 10https://gerrit.wikimedia.org/r/346303 [14:53:50] moritzm i just installed linux 4.9 and i get this [14:53:52] groups: cannot find name for group ID 50062 [14:53:52] groups: cannot find name for group ID 50380 [14:53:52] groups: cannot find name for group ID 51275 [14:53:52] groups: cannot find name for group ID 52308 [14:53:53] groups: cannot find name for group ID 53013 [14:53:54] groups: cannot find name for group ID 53259 [14:53:55] now [14:53:59] I didnt see that before with 4.4 [14:58:35] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/6017/ seems ok" [puppet] - 10https://gerrit.wikimedia.org/r/346303 (owner: 10Giuseppe Lavagetto) [14:59:02] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3153973 (10Andrew) Until I can get install1001 to STOP responding to labs dhcp requests, I'm going to assume that that's the problem. [15:02:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:02:26] (03PS1) 10Ema: cache_upload: properly detect 304s when unsetting CT [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [15:03:46] (03PS4) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [15:05:07] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: Puppet has 10 failures [15:07:48] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:14] (03PS5) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [15:10:07] RECOVERY - check_puppetrun on pay-lvs2001 is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures [15:10:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:32] (03PS1) 10Giuseppe Lavagetto: Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305 [15:11:34] (03PS1) 10Giuseppe Lavagetto: Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 [15:11:36] (03PS1) 10Giuseppe Lavagetto: Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 [15:11:38] (03PS1) 10Giuseppe Lavagetto: Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 [15:11:40] (03PS1) 10Giuseppe Lavagetto: Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 [15:11:42] (03PS1) 10Giuseppe Lavagetto: Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310 [15:11:44] (03PS1) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 [15:12:29] (03CR) 10Volans: "Puppet compiler results: https://puppet-compiler.wmflabs.org/6019/" [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [15:13:18] (03CR) 10jerkins-bot: [V: 04-1] Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto) [15:13:20] (03CR) 10jerkins-bot: [V: 04-1] Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 (owner: 10Giuseppe Lavagetto) [15:13:22] (03CR) 10jerkins-bot: [V: 04-1] Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 (owner: 10Giuseppe Lavagetto) [15:13:30] (03CR) 10jerkins-bot: [V: 04-1] Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 (owner: 10Giuseppe Lavagetto) [15:13:32] (03CR) 10jerkins-bot: [V: 04-1] Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310 (owner: 10Giuseppe Lavagetto) [15:13:52] (03CR) 10jerkins-bot: [V: 04-1] Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [15:15:07] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [15:16:21] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1004 & 5 refresh - https://phabricator.wikimedia.org/T161754#3154025 (10chasemp) 05Open>03stalled [15:16:32] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1006 & 7 refresh - https://phabricator.wikimedia.org/T161755#3154026 (10chasemp) 05Open>03stalled [15:29:16] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3154082 (10Andrew) Ok, I no longer thing that install1001 is involved. Instead, it's something to do with an IP getting assigned to two instances at onc... [15:29:27] 06Operations, 10ops-codfw, 10DBA: codfw racking first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Marostegui) [15:33:14] (03PS2) 10Giuseppe Lavagetto: Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305 [15:33:16] (03PS2) 10Giuseppe Lavagetto: Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 [15:33:18] (03PS2) 10Giuseppe Lavagetto: Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 [15:33:19] (03PS2) 10Giuseppe Lavagetto: Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 [15:33:22] (03PS2) 10Giuseppe Lavagetto: Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 [15:33:24] (03PS2) 10Giuseppe Lavagetto: Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310 [15:33:26] (03PS2) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 [15:35:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:50] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:38:50] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:40:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:42:10] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:50] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:45:37] moritzm im noticing performance improvements with linux 4.9. Unless it's because i rebooted. But ssh in is faster and running commands are faster. [15:47:41] (03PS1) 10Jcrespo: mariadb: Depool db1034 temporarilly to run ALTER TABLE on revision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346313 (https://phabricator.wikimedia.org/T159319) [15:48:08] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1034 temporarilly to run ALTER TABLE on revision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346313 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:49:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:50:01] (03CR) 10Jcrespo: [C: 04-1] "This fixes puppet, but I do not think it makes it work." [puppet] - 10https://gerrit.wikimedia.org/r/345847 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [15:50:59] (03CR) 10Volans: "A minor comment inline." (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto) [15:51:01] (03PS2) 10Jcrespo: mariadb: Depool db1034 temporarilly to run ANALYZE on revision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346313 (https://phabricator.wikimedia.org/T159319) [15:54:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:54:26] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1034 temporarilly to run ANALYZE on revision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346313 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:54:42] (03CR) 10jenkins-bot: mariadb: Depool db1034 temporarilly to run ANALYZE on revision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346313 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:56:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 for maintenance (duration: 00m 44s) [15:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:30] (03CR) 10Volans: "A couple of minor comments inline." (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto) [15:57:25] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346307 (owner: 10Giuseppe Lavagetto) [15:58:36] (03CR) 10Volans: [C: 031] "LGTM, but depends on the final choice of the switch procedure for traffic" [switchdc] - 10https://gerrit.wikimedia.org/r/346308 (owner: 10Giuseppe Lavagetto) [15:59:11] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346309 (owner: 10Giuseppe Lavagetto) [15:59:20] !log running ANALIZE on revision table for on eswiki,cawiki on db1034 [15:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:51] !log reimage analytics1052 (Hadoop Journal node) to Debian Jessie [15:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T1600). [16:00:16] 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154178 (10Papaul) [16:00:38] 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Papaul) p:05Triage>03Normal a:03Papaul [16:03:25] (03CR) 10Volans: [C: 031] "LGTM but might depend on the switch procedure for traffic" [switchdc] - 10https://gerrit.wikimedia.org/r/346310 (owner: 10Giuseppe Lavagetto) [16:03:32] !log Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds [16:03:38] sjoerddebruin: FYI^ [16:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:39] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [16:05:57] (03CR) 10Volans: "A couple of comments inline" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [16:06:02] no patches scheduled from what I can see - following godog's best practices: https://giphy.com/gifs/funny-happy-excited-gTNSX6N7vcKOY [16:07:10] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:10] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:11:45] (03CR) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts (0316 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346118 (owner: 10Giuseppe Lavagetto) [16:12:58] (03PS4) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346118 [16:20:37] (03PS5) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346118 [16:21:31] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3154217 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1052.eqiad.wmnet'] ``` The log can b... [16:26:17] (03PS2) 10Ema: cache_upload: properly detect 304s when unsetting CT [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [16:28:15] (03PS3) 10Ema: cache_upload: properly detect 304s when unsetting CT [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [16:29:32] (03PS1) 10Andrew Bogott: Nova dnsmasq: Reduce lease times and ttls by a lot [puppet] - 10https://gerrit.wikimedia.org/r/346318 (https://phabricator.wikimedia.org/T160908) [16:31:06] (03PS4) 10Ema: cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [16:36:10] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:40:11] (03PS5) 10Ema: cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [16:45:03] (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/mediawiki-services-parsoid-testreduce/commit/a76785d3cc77b58d3d5f3062af6ba3c4748dc1f1 now fixes testreduce to" [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [16:46:11] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 [16:46:22] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3154270 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1052.eqiad.wmnet'] ``` and were **ALL** successful. [16:46:46] (03CR) 10Jcrespo: [C: 04-2] "Not yet, until query finishes and replication lag recovers." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo) [16:46:52] (03CR) 10Giuseppe Lavagetto: [C: 031] "a couple of smallish comments but LGTM. It can even be merged as-is." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:49:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=448.00 Read Requests/Sec=505.80 Write Requests/Sec=0.80 KBytes Read/Sec=36488.40 KBytes_Written/Sec=17.20 [16:54:02] (03PS1) 10Giuseppe Lavagetto: cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 [16:54:04] (03PS1) 10Giuseppe Lavagetto: discovery::app_routes: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346321 [16:54:06] (03PS1) 10Giuseppe Lavagetto: cache::text: remove direct route to mediawiki from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/346322 [16:56:20] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [16:57:20] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3130118 keys, up 12 days 42 minutes - replication_delay is 0 [16:59:11] (03CR) 10Volans: "Done" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:59:23] (03PS6) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T1700). Please do the needful. [17:00:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=24.10 Read Requests/Sec=0.30 Write Requests/Sec=0.30 KBytes Read/Sec=1.20 KBytes_Written/Sec=17.60 [17:03:28] we might have a parsoid deploy later on once arlo is back, but if others are deploying, please go ahead. [17:15:07] RECOVERY - check_swap on lutetium is OK: SWAP OK - 100% free (7608 MB out of 7627 MB) [17:15:40] (03CR) 10Volans: [C: 031] "LGTM, single nitpick comment inline (on a comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346118 (owner: 10Giuseppe Lavagetto) [17:23:06] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) [17:24:11] (03PS7) 10Volans: Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) [17:24:55] Hi all, I'd like to ask everybody why today isn't Morning SWAT. Is it less frequent than other SWAT windows? [17:25:04] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [17:25:25] _joe_: Great! [17:26:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "on hold until the switchover is completed." [puppet] - 10https://gerrit.wikimedia.org/r/346173 (owner: 10Giuseppe Lavagetto) [17:27:24] (03CR) 10Volans: [C: 032] Switchdc: add profile to install and configure it [puppet] - 10https://gerrit.wikimedia.org/r/346279 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:28:23] (03PS6) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346118 [17:28:29] (03PS1) 10Chad: Scap clean: l10nupdate cache is owned by www-data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346325 [17:32:25] (03CR) 10Thcipriani: [C: 031] Scap clean: l10nupdate cache is owned by www-data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346325 (owner: 10Chad) [17:33:41] (03CR) 10Chad: [C: 032] Scap clean: l10nupdate cache is owned by www-data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346325 (owner: 10Chad) [17:34:49] (03Merged) 10jenkins-bot: Scap clean: l10nupdate cache is owned by www-data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346325 (owner: 10Chad) [17:36:29] (03CR) 10jenkins-bot: Scap clean: l10nupdate cache is owned by www-data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346325 (owner: 10Chad) [17:36:44] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346118 (owner: 10Giuseppe Lavagetto) [17:37:35] (03PS7) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [17:38:35] (03PS1) 10Giuseppe Lavagetto: Revert "base::puppet: add puppet helper scripts" [puppet] - 10https://gerrit.wikimedia.org/r/346326 [17:38:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "base::puppet: add puppet helper scripts" [puppet] - 10https://gerrit.wikimedia.org/r/346326 (owner: 10Giuseppe Lavagetto) [17:38:42] paladox: fyi https://phabricator.wikimedia.org/T162029 [17:38:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "base::puppet: add puppet helper scripts" [puppet] - 10https://gerrit.wikimedia.org/r/346326 (owner: 10Giuseppe Lavagetto) [17:38:56] <_joe_> grrr [17:38:56] because i saw you requesting 4.9 kernel [17:39:20] mutante thanks yep i am subscribed to that. I installed it on gerrit-test, gerrit-test3, jenkins-slave-01, phabricator. [17:39:30] paladox: cool! ok [17:39:34] <_joe_> I can't start to describe the WTF I just found :P [17:39:35] yep [17:39:47] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:57] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:07] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:17] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:17] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:17] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:17] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:17] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:27] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:27] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:27] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:27] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:27] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:28] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:36] _joe_: related to your change? or mine? [17:40:37] (03PS8) 10Dzahn: nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) [17:40:37] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:37] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:38] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:38] PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:59] !log stopped ircecho to avoid IRC spam [17:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:09] <_joe_> volans: mine, and you'll love it [17:41:15] there is a script with the same name? [17:41:24] <_joe_> no [17:41:26] <_joe_> Error: Failed to apply catalog: Cannot alias File[/usr/local/sbin/] to ["/usr/local/sbin"] at /etc/puppet/modules/base/manifests/puppet.pp:116; resource ["File", "/usr/local/sbin"] already declared at /etc/puppet/modules/profile/manifests/base.pp:25 [17:41:42] <_joe_> so that dir is defined twice [17:41:50] lovely! [17:41:52] <_joe_> until the declaration was the same, it didn't fail [17:42:01] <_joe_> now I added a second file to the same define [17:42:06] <_joe_> so formally I changed nothing [17:42:09] <_joe_> and it fails [17:42:14] <_joe_> how awesome is that? [17:42:16] but you changed the "title" [17:42:24] awesome puppet [17:42:43] (03CR) 10Dzahn: [C: 032] nagios_common: fix/enhance check_ssl_certfile plugin [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) (owner: 10Dzahn) [17:43:36] <_joe_> anyways, fixing it [17:45:24] (03PS1) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346328 [17:47:12] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Legoktm) > Requests do not have a user agent There's no user-agent header at all or is it some generic UA? [17:48:51] (03PS2) 10Dzahn: Add ll to my bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/346270 (owner: 10Hoo man) [17:49:36] (03PS3) 10Dzahn: admins::hoo: Add ll to bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/346270 (owner: 10Hoo man) [17:49:46] (03CR) 10Dzahn: [C: 032] admins::hoo: Add ll to bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/346270 (owner: 10Hoo man) [17:50:05] (03CR) 10Dzahn: [V: 032 C: 032] admins::hoo: Add ll to bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/346270 (owner: 10Hoo man) [17:50:26] (03PS1) 10Chad: scap clean: Only prune staging files from the active master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346330 (https://phabricator.wikimedia.org/T161643) [17:53:23] !log disabling puppet on labvirts to roll out a nova config change [17:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:00] (03PS2) 10Andrew Bogott: Nova dnsmasq: Reduce lease times and ttls by a lot [puppet] - 10https://gerrit.wikimedia.org/r/346318 (https://phabricator.wikimedia.org/T160908) [17:55:37] (03CR) 10Thcipriani: [C: 031] scap clean: Only prune staging files from the active master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346330 (https://phabricator.wikimedia.org/T161643) (owner: 10Chad) [17:56:06] (03CR) 10Andrew Bogott: [C: 032] Nova dnsmasq: Reduce lease times and ttls by a lot [puppet] - 10https://gerrit.wikimedia.org/r/346318 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [17:57:08] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic, 05Security: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154452 (10MaxSem) [17:57:27] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:58:16] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10MaxSem) [17:59:45] (03CR) 10Chad: [C: 032] scap clean: Only prune staging files from the active master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346330 (https://phabricator.wikimedia.org/T161643) (owner: 10Chad) [18:02:22] (03Merged) 10jenkins-bot: scap clean: Only prune staging files from the active master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346330 (https://phabricator.wikimedia.org/T161643) (owner: 10Chad) [18:02:31] (03CR) 10jenkins-bot: scap clean: Only prune staging files from the active master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346330 (https://phabricator.wikimedia.org/T161643) (owner: 10Chad) [18:05:05] (03CR) 10BBlack: cache_upload: override CT updates on 304s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [18:07:17] (03CR) 10Dzahn: "works, it turned the Icinga checks to CRIT now as it should: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm" [puppet] - 10https://gerrit.wikimedia.org/r/346236 (https://phabricator.wikimedia.org/T162085) (owner: 10Dzahn) [18:07:40] ircecho was me, the recovery of all the previous failure are coming [18:09:07] (03PS1) 10Chad: Scap clean: Shut up non-error output from git ops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346331 [18:09:18] (03CR) 10Chad: [C: 032] Scap clean: Shut up non-error output from git ops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346331 (owner: 10Chad) [18:10:35] (03Merged) 10jenkins-bot: Scap clean: Shut up non-error output from git ops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346331 (owner: 10Chad) [18:10:47] (03CR) 10jenkins-bot: Scap clean: Shut up non-error output from git ops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346331 (owner: 10Chad) [18:12:03] 06Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring, 13Patch-For-Review: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#3154475 (10Dzahn) after the merge above now Icinga checks turned CRIT as they should have. due to a bug they stayed just WARN before for l... [18:13:07] (03CR) 10Dzahn: ":) thanks Alex" [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [18:15:06] all recovered, restarting ircecho [18:15:34] new alerts about expiring certs will show up, but that's the good part because they should have shown earlier [18:16:15] !log demon@tin Started scap: wmf.19 bootstrap [18:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:24] (03CR) 10Thcipriani: "One minor issue and one nit about deployment-prep to make sure salt isn't trying to manage the same repo as scap on the deployment boxen." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [18:20:32] (03PS1) 10Volans: Fix typo for dict access [switchdc] - 10https://gerrit.wikimedia.org/r/346332 (https://phabricator.wikimedia.org/T160178) [18:22:48] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 55963.01943 Seconds [18:23:27] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 56002.408584 Seconds [18:23:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 56013.677784 Seconds [18:24:27] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [18:24:37] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [18:24:47] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [18:25:05] gehel: FYI ^^^ [18:25:22] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3154500 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1053.eqiad.wmnet', 'analytics1054.eq... [18:25:37] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:26:16] volans: thanks! [18:29:25] yw :) [18:29:59] (03PS2) 10Jforrester: Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) [18:30:01] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346333 (https://phabricator.wikimedia.org/T162145) [18:33:47] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix typo for dict access [switchdc] - 10https://gerrit.wikimedia.org/r/346332 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [18:36:46] _joe_: I was assuming you want to first merge yours and this after to avoid all the rebasing ;) [18:37:18] <_joe_> nah, watever [18:37:37] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:05] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346328 (owner: 10Giuseppe Lavagetto) [18:38:12] (03PS2) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346328 [18:38:17] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346328 (owner: 10Giuseppe Lavagetto) [18:42:48] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:44:47] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:00] (03PS1) 10Giuseppe Lavagetto: base::puppet: actually install run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/346337 [18:48:29] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: actually install run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/346337 (owner: 10Giuseppe Lavagetto) [18:48:57] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] base::puppet: actually install run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/346337 (owner: 10Giuseppe Lavagetto) [18:50:19] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3154665 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1053.eqiad.wmnet', 'analytics1054.eqiad.wmnet'] ``` and were **ALL** successful. [18:51:14] 06Operations, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn) [18:51:23] 06Operations, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154678 (10Dzahn) a:03Dzahn [18:51:31] !log demon@tin Finished scap: wmf.19 bootstrap (duration: 35m 16s) [18:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:43] (03PS1) 10Giuseppe Lavagetto: base::puppet: add include to run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/346338 [18:52:23] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154681 (10jcrespo) User agent was "-" (without quotes). [18:52:32] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] base::puppet: add include to run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/346338 (owner: 10Giuseppe Lavagetto) [18:54:57] 06Operations, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154688 (10Dzahn) [18:55:00] (03PS1) 10Chad: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346339 [18:55:54] !log demon@tin Synchronized php: symlink repoint (duration: 00m 39s) [18:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:26] 06Operations, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn) p:05Triage>03Normal [18:57:23] 06Operations, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154714 (10Dzahn) [18:57:27] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10MaxSem) We used to block API requests that provided no UA - anybody remembers why did we stop doing that? [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T1900). [19:02:40] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:03:21] (03PS2) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) [19:03:56] (03CR) 10Mobrovac: RESTBase: Migrate to Scap3 deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [19:04:50] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:31] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:06:21] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:08:31] (03CR) 10Chad: [C: 032] group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346339 (owner: 10Chad) [19:10:12] 06Operations, 10RESTBase, 10RESTBase-Cassandra: cassandra client authentication - https://phabricator.wikimedia.org/T112742#3154775 (10Volker_E) [19:10:35] (03Merged) 10jenkins-bot: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346339 (owner: 10Chad) [19:10:53] 06Operations, 13Patch-For-Review: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1954131 (10Dzahn) came here looking to do this for a similar issue. would have been nice to see the actual command that was the solution here. [19:10:57] (03CR) 10jenkins-bot: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346339 (owner: 10Chad) [19:11:50] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:16:38] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3154837 (10Cmjohnson) Set the raid cfg to raid 10 [19:16:53] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3154838 (10Cmjohnson) Set the raid cfg to raid 10 [19:17:46] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.19 [19:17:50] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:02] 06Operations, 10ops-eqiad: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3022054 (10Cmjohnson) @arielglenn, clean up everything but dns and update task. I will wipe it and remove dns once off the rack. [19:22:50] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:26:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:27:50] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 59863.120938 Seconds [19:28:50] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [19:29:13] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Tgr) >>! In T162129#3154681, @jcrespo wrote: > User agent was "-" (without quotes). More likely, nothing at all. The... [19:32:50] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:33:52] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154911 (10Tgr) Did the IPs change periodically or did they actually use 50 boxes to query the API in parallel? The second case s... [19:47:00] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:49:29] (03CR) 10Thcipriani: [C: 031] RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [19:53:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:54:05] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154951 (10Tgr) Seems to have restarted (at least based on raw GET volume, haven't looked at what type it is). See P5199#27747 f... [19:54:50] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:59:50] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:00:20] !log rolling out a border-in4 ACL update across core routers (T160055) [20:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:28] T160055: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055 [20:00:31] (the ulsfo alert was unrelated, not sure what's up with that) [20:02:35] (03PS1) 10Dzahn: renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 [20:03:05] (03CR) 10Dzahn: [C: 04-2] renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 (owner: 10Dzahn) [20:04:15] (03PS2) 10Dzahn: renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 (https://phabricator.wikimedia.org/T162085) [20:05:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:05:55] 06Operations, 10netops: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055#3154987 (10faidon) 05Open>03Resolved a:03faidon I just deployed a change which puts 224/4 back to special-ranges4 and nothing seems to be broken. [20:06:22] (03CR) 10Dzahn: [C: 04-1] "don't merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/346356 (https://phabricator.wikimedia.org/T162085) (owner: 10Dzahn) [20:08:13] (03CR) 10Hashar: [C: 031] "Zuul ssh to Gerrit using Paramiko 1.15.1. I gave it a quick try from contint1001 by running /var/lib/zuul/gerrit-stream-events.py None o" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [20:10:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:12:20] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:15:00] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:16:10] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Anomie) >>! In T162129#3154715, @MaxSem wrote: > We used to block API requests that provided no UA - anybody remembers... [20:17:48] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155015 (10jcrespo) He is back, and now trying to parse Special pages, too :-) > Did the IPs change periodically or did they act... [20:22:48] (03PS7) 10Thcipriani: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [20:24:40] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:25] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3155030 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1056.eqiad.wmnet'] ``` The log can b... [20:33:10] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3155081 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1055.eqiad.wmnet'] ``` The log can b... [20:33:40] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:19] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3155103 (10Jgreen) 05Open>03Resolved a:03Jgreen looks good, host is imaged and up! [20:38:41] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3155107 (10Jgreen) 05Open>03Resolved looks good, host is imaged and up! [20:39:42] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155110 (10Anomie) The simple solution may be to just block the IPs in varnish or the like, perhaps delivering a message like "If... [20:40:20] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:47:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:48:22] !log catrope@tin Synchronized php-1.29.0-wmf.19/extensions/Echo/: T162173 (duration: 00m 43s) [20:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:28] T162173: Clicking on Notices/Alerts issues a banner over the other icon - https://phabricator.wikimedia.org/T162173 [20:50:09] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3155140 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1056.eqiad.wmnet'] ``` and were **ALL** successful. [20:52:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:52:40] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:53:16] (03PS8) 10Thcipriani: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [20:56:43] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155143 (10Tgr) > I don't think it is malign, just parallelizing queries to load balancing source IPs (always the same ones). Ye... [20:58:47] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3155144 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1055.eqiad.wmnet'] ``` and were **ALL** successful. [21:00:45] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:00:48] (03CR) 10Hashar: "And I have verified the CI instances that build packages are all clean. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [21:00:55] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:34] (03PS3) 10Dzahn: renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 (https://phabricator.wikimedia.org/T162085) [21:12:40] !log revoked old labvirt-star.eqiad.wmnet cert - created new csr, signed it (CA: wmf_ca_2014_2017). deploying new labvirt-star.eqiad valid for 720 days (T162085) [21:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:46] T162085: labvirt-star.eqiad.wmnet.crt expiring soon - https://phabricator.wikimedia.org/T162085 [21:13:08] (03PS4) 10Dzahn: renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 (https://phabricator.wikimedia.org/T162085) [21:16:16] (03CR) 10Dzahn: [C: 032] renew labvirt-star.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/346356 (https://phabricator.wikimedia.org/T162085) (owner: 10Dzahn) [21:18:32] !log running puppet across labvirt10* to replace cert [21:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:58] andrewbogott: done, icinga is all green again :) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm [21:20:08] cool [21:20:38] !log applying mariadb MDEV#7383 patch on db1034 T159319 [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:26:33] !log mobrovac@tin Started deploy [citoid/deploy@7dbbac8]: Bump service-runner to pick up new DNS caching [21:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:27:12] 06Operations, 06Labs, 10Labs-Infrastructure: labvirt-star.eqiad.wmnet.crt expiring soon - https://phabricator.wikimedia.org/T162085#3155258 (10Dzahn) [21:27:23] 06Operations, 06Labs, 10Labs-Infrastructure: labvirt-star.eqiad.wmnet.crt expiring soon - https://phabricator.wikimedia.org/T162085#3152009 (10Dzahn) 05Open>03Resolved [21:27:56] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:29:00] 06Operations, 06Labs, 10Labs-Infrastructure: labvirt-star.eqiad.wmnet.crt expiring soon - https://phabricator.wikimedia.org/T162085#3152009 (10Dzahn) @labvirt1014:~# openssl x509 -in /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt -text -noout | grep After Not After : Mar 25 21:00:52 2019 GMT [21:29:46] !log mobrovac@tin Finished deploy [citoid/deploy@7dbbac8]: Bump service-runner to pick up new DNS caching (duration: 03m 13s) [21:29:48] (03PS1) 10Andrew Bogott: Revert "Keystonehooks: Exclude 'novaobserver' user from posix user group." [puppet] - 10https://gerrit.wikimedia.org/r/346451 [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:56] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:24] (03PS1) 10Jdlrobson: Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) [21:31:25] (03PS1) 10Jdlrobson: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T160076) [21:31:29] (03CR) 10Jdlrobson: [C: 04-1] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [21:31:34] (03CR) 10Andrew Bogott: [C: 032] Revert "Keystonehooks: Exclude 'novaobserver' user from posix user group." [puppet] - 10https://gerrit.wikimedia.org/r/346451 (owner: 10Andrew Bogott) [21:33:54] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155315 (10ssastry) [21:33:59] !log mobrovac@tin Started deploy [eventstreams/deploy@cf892f4]: Bump service-runner to pick up new DNS caching [21:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:05] !log mobrovac@tin Finished deploy [eventstreams/deploy@cf892f4]: Bump service-runner to pick up new DNS caching (duration: 02m 04s) [21:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:03] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3155345 (10Dzahn) a:05RobH>03Ayokura [21:38:20] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10Dzahn) a:05Ayokura>03ayounsi [21:40:36] !log mobrovac@tin Started deploy [mathoid/deploy@4eb6d9d]: Bump service-runner to pick up new DNS caching [21:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:07] 06Operations, 10DBA, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3155357 (10jcrespo) [21:44:04] !log mobrovac@tin Finished deploy [mathoid/deploy@4eb6d9d]: Bump service-runner to pick up new DNS caching (duration: 03m 27s) [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:03] !log mobrovac@tin Started deploy [cxserver/deploy@b4184d3]: Bump service-runner to pick up new DNS caching [21:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:36] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [21:48:40] !log mobrovac@tin Finished deploy [cxserver/deploy@b4184d3]: Bump service-runner to pick up new DNS caching (duration: 03m 37s) [21:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:27] !log mobrovac@tin Started deploy [mobileapps/deploy@b93488f]: Bump service-runner to pick up new DNS caching [21:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:22] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3155373 (10faidon) a:05ayounsi>03RobH [21:52:10] !log mobrovac@tin Finished deploy [mobileapps/deploy@b93488f]: Bump service-runner to pick up new DNS caching (duration: 02m 43s) [21:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:43] (03PS1) 10Niharika29: Update $wgLoginNotifyAttemptsKnownIP in Labs to make testing easier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346464 (https://phabricator.wikimedia.org/T160094) [21:52:53] 06Operations, 10ops-codfw, 10hardware-requests, 10netops, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3155377 (10Dzahn) [21:53:19] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10Dzahn) [21:53:27] !log mobrovac@tin Started deploy [graphoid/deploy@5fc26cb]: Bump service-runner to pick up new DNS caching [21:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:06] !log mobrovac@tin Started deploy [trending-edits/deploy@5cc3969]: Bump service-runner to pick up new DNS caching [21:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:35] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3155380 (10BBlack) [21:55:42] !log mobrovac@tin Finished deploy [graphoid/deploy@5fc26cb]: Bump service-runner to pick up new DNS caching (duration: 02m 15s) [21:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:56] RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:00:46] !log mobrovac@tin Finished deploy [trending-edits/deploy@5cc3969]: Bump service-runner to pick up new DNS caching (duration: 06m 40s) [22:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:38] !log SCB all services updated to use the new service-runner DNS caching [22:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:15] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/6023/" [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [22:09:55] jouncebot: next [22:09:55] In 0 hour(s) and 50 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T2300) [22:09:59] jouncebot: now [22:10:00] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [22:10:06] Ok I'm stealing a slot for scap [22:19:55] (03PS1) 10Catrope: Set $wgOresThresholds now that it exists in wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 [22:20:14] (03PS2) 10Jdlrobson: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) [22:21:50] (03PS2) 10Catrope: Set $wgOresThresholds on wikis where both ORES and rcfilters are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 [22:22:34] (03PS3) 10Catrope: Set $wgOresThresholds on wikis where both ORES and rcfilters are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 [22:22:41] * p858snake watches RainbowSprinkles be put in cuffs by the DeploymentPolice™ for stealing [22:23:10] lol [22:24:51] before i get yelled at i need to make this ONE joke then im done... RainbowSprinkles do you want some scap with a side of sprinkles? [22:25:11] I don't get the joke [22:25:24] lol [22:25:27] p858snake: I am the deployment police. [22:25:40] We saw it. [22:25:46] PROBLEM - Hadoop DataNode on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [22:26:06] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:06] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:37] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:27:06] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:27:36] dbstore1002 is probably just temporary extra load [22:27:46] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:46] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:46] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:46] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:46] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:47] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:48] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:48] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:56] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:56] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:28:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:29:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:30:36] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:30:36] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:30:36] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:30:36] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:30:36] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:30:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [22:30:37] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:30:38] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:30:38] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:30:46] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:32:14] 06Operations, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, 07Privacy: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#3155492 (10Reedy) [22:33:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:34:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:34:28] !log demon@tin Started scap: re-syncing old wmf.14-16 branches...cleaned up a little too much [22:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:13] 06Operations, 10ops-eqiad, 10netops: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199#3155501 (10Reedy) [22:38:36] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:36] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:36] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:47] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:48] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:48] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:56] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:56] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:56] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:57] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:57] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:57] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:39:06] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:39:16] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:36] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [22:44:28] I am going to ack all of dbstore1002 so it doesn't keep spaming [22:44:37] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [22:44:37] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:44:37] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:44:37] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:37] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:37] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:44:38] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:38] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:44:46] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:46] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:46] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:46] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [22:44:46] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:44:47] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [22:45:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:46:16] PROBLEM - Keystone admin and observer projects exist on labtestnet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:56] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:48:06] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:48:06] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:48:27] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:48:27] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:48:27] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [22:48:59] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3155545 (10Dzahn) I signed Arzhel's GPG key after he read the fingerprint to me over Hangout. gpg --fingerprint 58E24182 Key fingerprint = 8F89 0CBB E7BE... [22:50:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:54:36] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:56:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T2300). Please do the needful. [23:00:04] Niharika, Jdlrobson, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] o/ [23:00:21] I'll do the SWAT to day [23:00:40] Niharika's is labs only so that one can go first [23:00:44] (03CR) 10Catrope: [C: 032] Update $wgLoginNotifyAttemptsKnownIP in Labs to make testing easier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346464 (https://phabricator.wikimedia.org/T160094) (owner: 10Niharika29) [23:01:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:01:42] jdlrobson: You around for your SWAT? [23:01:50] yup [23:01:53] (03CR) 10Catrope: [C: 032] Set $wgOresThresholds on wikis where both ORES and rcfilters are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 (owner: 10Catrope) [23:01:59] Cool [23:02:05] (03CR) 10Catrope: [C: 032] Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [23:02:15] jouncebot: now [23:02:15] For the next 0 hour(s) and 57 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170404T2300) [23:02:46] * RoanKattouw waits for Jenkins [23:03:04] Wonder if https://gerrit.wikimedia.org/r/#/c/346274/ should just go out... [23:03:10] (03Merged) 10jenkins-bot: Update $wgLoginNotifyAttemptsKnownIP in Labs to make testing easier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346464 (https://phabricator.wikimedia.org/T160094) (owner: 10Niharika29) [23:03:18] Reverted in master, presumably in time for .19... But is broken in .18 [23:03:24] (03CR) 10jenkins-bot: Update $wgLoginNotifyAttemptsKnownIP in Labs to make testing easier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346464 (https://phabricator.wikimedia.org/T160094) (owner: 10Niharika29) [23:03:42] (03Merged) 10jenkins-bot: Set $wgOresThresholds on wikis where both ORES and rcfilters are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 (owner: 10Catrope) [23:03:52] (03CR) 10jenkins-bot: Set $wgOresThresholds on wikis where both ORES and rcfilters are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346470 (owner: 10Catrope) [23:04:14] Reedy: Ha, nice one. I'll pick that one up, could you add it to the wiki page for the record? [23:04:19] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155592 (10Dzahn) I'm not sure what happened here, but yes, 0.7.0 has been uploaded and also reprepro itself thinks so: ``` [bromine:/srv/org] $ sudo -E reprepro ls parsoid parsoid | 0.7.0all | jes... [23:04:26] Thanks RoanKattouw. [23:04:52] I'm still scapping [23:05:19] Nearly done tho [23:05:28] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155594 (10Dzahn) /srv/org/wikimedia/reprepro/incoming/ has: ``` 16M -rw-r--r-- 1 reprepro reprepro 16M Nov 14 18:09 parsoid_0.6.0all_all.deb 4.0K -rw-r--r-- 1 reprepro reprepro 1.9K Nov 14 18:0... [23:05:46] RainbowSprinkles: OK, will wait [23:05:57] Niharika: You can follow the automated scap to beta labs in real time here: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/149440/console [23:06:06] Done [23:06:13] RoanKattouw: Feel free to start doing gerrit merges, staging stuff on tin [23:06:19] It's just the final apache pull I'm in now [23:06:20] Cool. [23:06:42] And it's done, your patch should be in labs now [23:06:54] \m/ [23:07:11] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155611 (10ssastry) >>! In T162200#3155592, @Dzahn wrote: > I'm not sure what happened here, but yes, 0.7.0 has been uploaded and also reprepro itself thinks so: > > > ``` > [bromine:/srv/org] $ su... [23:08:58] RainbowSprinkles: Yeah doing that already, pulling to mwdebug1002 too but that hung for some reason [23:09:55] Ugh, it's doing cdb-rebuild --no-progress which is taking forever [23:09:59] And it's also not telling me that that's what it's doing [23:10:08] * RoanKattouw files task [23:10:46] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:12:59] I probably already have a lock on it? [23:13:08] We're doing cdb rebuilds shortly as part of my scap [23:14:02] On mwdebug1002? [23:14:15] No it was just a 3-minute rebuild with no reporting whatsoever [23:14:26] It was using 90% CPU, didn't look like a lock [23:14:33] Filing a task about that [23:14:43] jdlrobson: Your change is live on mwdebug1002 now, please test. Sorry for the delay [23:15:01] Doing full pulls on mwdebug is always kind of funny when we end up only doing a sync-file or sync-dir afterwords for full deployment ;-) [23:15:07] * RainbowSprinkles chuckles about mwdebug in swat [23:15:45] We could abstract that into a param for scap so you don't have to ssh to that host and do funny pulls [23:15:48] RoanKattouw: testing [23:16:07] Yeah, I mean it's usually fast engouh [23:16:26] And I don't even terribly mind it taking 3 minutes, as long as there's some kind of progress indication [23:16:34] (filed T162207 ) [23:16:34] T162207: When "scap pull" does a (slow) CDB rebuild, it should tell me that that's what it's doing - https://phabricator.wikimedia.org/T162207 [23:16:38] Actually. [23:16:41] looks good RoanKattouw [23:16:52] RoanKattouw: tbf, it does use --no-progress ;-) [23:16:56] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:18:46] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:01] !log demon@tin Finished scap: re-syncing old wmf.14-16 branches...cleaned up a little too much (duration: 44m 32s) [23:19:06] Sure, and I understand that it would want to suppress progress reporting from rebuild-cdb itself [23:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:34] But it should at least tell me something like 23:07:12 started cdb rebuild 23:10:45 finished cdb rebuild [23:21:16] jdlrobson: Oops I didn't actually sync your patch :( [23:21:18] Trying again [23:21:26] I was wondering why mine wasn't working.. [23:21:57] jdlrobson: OK now it's there for reals [23:22:56] (03PS2) 10Catrope: Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [23:23:03] (03CR) 10Catrope: [C: 032] Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [23:23:07] woop [23:23:15] jdlrobson: Urgh, yours didn't even merge, so it's doubly not there [23:23:25] Sorry about that, I was juggling three patches and lost track [23:24:14] (03Merged) 10jenkins-bot: Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [23:24:27] (03CR) 10jenkins-bot: Prepare for related pages configuration change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346452 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [23:27:41] Alright, my patch works [23:27:44] RoanKattouw: done for reals now? [23:27:51] jdlrobson: Your patch is now on mwdebug1002 for real for real [23:27:58] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3155755 (10jcrespo) [23:28:16] I guess it might be intended to be a no-op anyway? [23:28:59] RoanKattouw: yup [23:29:03] RoanKattouw: Oh, that rogue wmf.8 is gone now btw, and shouldn't happen again [23:29:09] Weird growing pains with `scap clean` [23:29:13] !log unscheduled restart of dbstore1002 T162212 [23:29:17] Thanks [23:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:21] T162212: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212 [23:29:41] jdlrobson: OK lemme know when you're done checking and I'll deploy both of our patches [23:29:45] yup checked again [23:29:50] Sweet, going live [23:29:50] looks good [23:31:08] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Prepare for related pages config change (T160076) and set $wgOresFiltersThresholds on plwiki and ptwiki (duration: 00m 41s) [23:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:14] T160076: Disable related pages on desktop beta mode - https://phabricator.wikimedia.org/T160076 [23:34:36] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [23:35:16] thanks RoanKattouw [23:35:36] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3155792 (10aaron) The timeout could be conditioned on ``` defined( 'MEDIAWIKI_JOB_RUNNER' ) ``` ...via a te... [23:38:46] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:45:25] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3155813 (10jcrespo) Probably excessive memory pressure due to heavy mysql usage... blah blah blah... restarted cleanly ... updated kernel... check new import script... check long running queries,... mysql error log is cle... [23:46:46] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:50:22] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3155832 (10jcrespo) > @leila, we can dump and copy to analytics-store, as long as there aren't any database.table name collisions. I hope you are aware that if for any reason... [23:50:55] !log reedy@tin Synchronized php-1.29.0-wmf.18/extensions/Quiz: (no justification provided) (duration: 00m 42s) [23:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:03] !log that was Revert "Start implementing Quiz generation using TemplateParser" [23:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:42] Reedy: ill edit sal and consolidate the two log messages if you want [23:55:54] !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration: (no justification provided) (duration: 00m 39s) [23:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:20] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3155839 (10jcrespo) There is also more load than usual since the 29, that could have contributed to it: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1002&from=... [23:58:12] Reedy: fixed your log mistake your welcome