[00:00:32] RoanKattouw, is it supposed to do this? [00:00:34] https://se.wikimedia.org/wiki/Diskussion:Kontor/F%C3%B6rslag/LQT_Archive_1 [00:00:42] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/245588/ and https://gerrit.wikimedia.org/r/#/c/245589/ (duration: 01m 13s) [00:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:33] Krenair: Whoa wat [00:01:38] Oh, crap [00:01:41] There's a script I need to run [00:01:47] Krenair: Can you revert that change for now? [00:02:03] Sorry :S [00:02:33] not easily since you did them as two separate changes [00:02:36] but ok [00:03:08] have reset on tin, will make a proper revert commit [00:03:48] Thanks [00:04:03] it's basically sync'd, fwiw [00:04:07] just that last host being slow [00:04:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: rv (duration: 01m 13s) [00:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:17] (03PS1) 10Alex Monk: Revert last two sewikimedia Flow commits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245608 [00:07:55] (03CR) 10Alex Monk: [C: 032] Revert last two sewikimedia Flow commits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245608 (owner: 10Alex Monk) [00:08:01] (03Merged) 10jenkins-bot: Revert last two sewikimedia Flow commits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245608 (owner: 10Alex Monk) [00:09:25] (03CR) 10Alex Monk: "I should probably make it clear that one part of the second commit was not reverted (the officewiki bit)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245608 (owner: 10Alex Monk) [00:10:07] (03CR) 10Alex Monk: "Reverted in I890213ff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245588 (https://phabricator.wikimedia.org/T106302) (owner: 10Catrope) [00:10:12] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/245608/ (duration: 01m 13s) [00:10:14] RECOVERY - puppet last run on ganeti2005 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:02] (03CR) 10Alex Monk: "Half reverted in I890213ff - also, it was officewiki not mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245589 (owner: 10Catrope) [00:12:09] (03PS2) 10Alex Monk: Enable Flow beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244724 (https://phabricator.wikimedia.org/T115100) (owner: 10Catrope) [00:12:14] RoanKattouw, ^ that's ready to go, right? [00:17:43] Yup [00:19:25] (03CR) 10Alex Monk: [C: 032] Enable Flow beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244724 (https://phabricator.wikimedia.org/T115100) (owner: 10Catrope) [00:19:31] (03Merged) 10jenkins-bot: Enable Flow beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244724 (https://phabricator.wikimedia.org/T115100) (owner: 10Catrope) [00:21:06] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/244724/ (duration: 01m 13s) [00:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:24] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:24:35] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [00:25:57] Yay the Flow beta feature works on zhwiki [00:27:21] Someone needs to make origni an actual remote [00:27:59] EHAD would be great too [00:29:32] !log krenair@tin Synchronized php-1.27.0-wmf.2/extensions/Flow/modules/flow-initialize.js: https://gerrit.wikimedia.org/r/#/c/245596/ (duration: 01m 13s) [00:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:04] that seems to have fixed it [00:37:07] 6operations, 6Labs, 10Labs-Infrastructure, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1720401 (10Krenair) [00:47:37] (03PS1) 10Catrope: Revert "Revert last two sewikimedia Flow commits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245609 [00:47:45] (03CR) 10Catrope: [C: 032] Revert "Revert last two sewikimedia Flow commits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245609 (owner: 10Catrope) [00:47:51] (03Merged) 10jenkins-bot: Revert "Revert last two sewikimedia Flow commits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245609 (owner: 10Catrope) [00:51:33] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:51:38] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Reapply: Flow-occupy talk namespaces on sewikimedia (duration: 01m 13s) [00:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:38:34] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [01:40:11] 7Puppet, 6Labs, 7Documentation: Missing documentation for labs puppet roles - https://phabricator.wikimedia.org/T91770#1720501 (10Krenair) [01:40:13] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [01:46:06] (03PS1) 10Faidon Liambotis: Remove nfs::netapp::home(::othersite), unused [puppet] - 10https://gerrit.wikimedia.org/r/245613 [01:46:08] (03PS1) 10Faidon Liambotis: Remove nfs::common, replaced by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/245614 [01:46:10] (03PS1) 10Faidon Liambotis: Inline class nfs::data to snapshot::common [puppet] - 10https://gerrit.wikimedia.org/r/245615 [01:46:12] (03PS1) 10Faidon Liambotis: Remove classes snasphot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [01:47:32] (03PS2) 10Faidon Liambotis: Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [01:47:47] (03PS1) 10Faidon Liambotis: Update BounceHandler's "Internal IPs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245617 [01:49:43] (03CR) 10Faidon Liambotis: [C: 04-1] "This doesn't seem to work: the curl to meta returns a 301 to "http://meta.wikimedia.org/w/api.php" when run with the exact same parameters" [puppet] - 10https://gerrit.wikimedia.org/r/245128 (https://phabricator.wikimedia.org/T114984) (owner: 1001tonythomas) [01:51:59] (03CR) 10Faidon Liambotis: [C: 032] Update BounceHandler's "Internal IPs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245617 (owner: 10Faidon Liambotis) [01:52:06] (03Merged) 10jenkins-bot: Update BounceHandler's "Internal IPs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245617 (owner: 10Faidon Liambotis) [01:53:37] !log faidon@tin Synchronized wmf-config/CommonSettings.php: unbreak BounceHandler (duration: 01m 14s) [01:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:56:31] (03PS2) 10Faidon Liambotis: Make mx1001/mx2001 to HTTP POST to meta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245128 (https://phabricator.wikimedia.org/T114984) (owner: 1001tonythomas) [01:56:55] (03CR) 10Faidon Liambotis: [C: 032 V: 032] "Ignore me, I was being silly :)" [puppet] - 10https://gerrit.wikimedia.org/r/245128 (https://phabricator.wikimedia.org/T114984) (owner: 1001tonythomas) [02:00:40] (03CR) 10Faidon Liambotis: [C: 04-1] wikitech: add SSL cert expiry monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [02:06:36] 6operations, 10MobileFrontend, 10Traffic, 7Mobile, 10reading-web-sprint-58-6: ml.wikipedia.org not redirecting to mobile site while accessing from a mobile device; many "Error: Module not found" errors - https://phabricator.wikimedia.org/T115191#1720539 (10Krenair) I was browsing through the wikimedia va... [02:14:27] (03CR) 10Faidon Liambotis: [C: 04-1] "Pretty solid work. See inline for a couple of comments. Other than those it's a +2, feel free to merge without me after fixing :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/245504 (https://phabricator.wikimedia.org/T104738) (owner: 10Ori.livneh) [02:14:42] (03CR) 10Faidon Liambotis: [C: 032] Add grafana-test.wikimedia.org, behind misc-web-lb [dns] - 10https://gerrit.wikimedia.org/r/245503 (owner: 10Ori.livneh) [02:16:36] (03CR) 10Faidon Liambotis: [C: 04-1] "Refers to "grafana-testing.wm.org" while the rest of the patchset is for "grafana-test.wm.org". Will fix." [puppet] - 10https://gerrit.wikimedia.org/r/245494 (owner: 10Ori.livneh) [02:17:32] (03PS4) 10Faidon Liambotis: misc varnish: proxy grafana-test.wm.o to krypton as well [puppet] - 10https://gerrit.wikimedia.org/r/245494 (owner: 10Ori.livneh) [02:18:28] (03PS5) 10Faidon Liambotis: misc varnish: proxy grafana-test.wm.o to krypton as well [puppet] - 10https://gerrit.wikimedia.org/r/245494 (owner: 10Ori.livneh) [02:18:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] misc varnish: proxy grafana-test.wm.o to krypton as well [puppet] - 10https://gerrit.wikimedia.org/r/245494 (owner: 10Ori.livneh) [02:19:21] (03PS6) 10Faidon Liambotis: reprepro: import from grafana apt [puppet] - 10https://gerrit.wikimedia.org/r/245490 (owner: 10Ori.livneh) [02:22:43] (03CR) 10Faidon Liambotis: [C: 032] "I had the look at the package and I can't say I'm particularly thrilled. The repository isn't great either, as for example does not provid" [puppet] - 10https://gerrit.wikimedia.org/r/245490 (owner: 10Ori.livneh) [02:28:49] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 07m 01s) [02:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:17] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-13 02:32:17+00:00 [02:32:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 13 02:32:17 UTC 2015 (duration 32m 16s) [02:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:43] (03PS1) 10Faidon Liambotis: reprepro: actually add source grafana to jessie [puppet] - 10https://gerrit.wikimedia.org/r/245618 [02:32:45] (03PS1) 10Faidon Liambotis: Move apt-transport-https to install_server [puppet] - 10https://gerrit.wikimedia.org/r/245619 [02:38:08] (03PS1) 10Faidon Liambotis: reprepro: fix grafana's VerifyRelease key [puppet] - 10https://gerrit.wikimedia.org/r/245620 [02:38:31] (03CR) 10Faidon Liambotis: [C: 032] reprepro: actually add source grafana to jessie [puppet] - 10https://gerrit.wikimedia.org/r/245618 (owner: 10Faidon Liambotis) [02:38:55] (03CR) 10Faidon Liambotis: [C: 032] Move apt-transport-https to install_server [puppet] - 10https://gerrit.wikimedia.org/r/245619 (owner: 10Faidon Liambotis) [02:39:13] (03CR) 10Faidon Liambotis: [C: 032] reprepro: fix grafana's VerifyRelease key [puppet] - 10https://gerrit.wikimedia.org/r/245620 (owner: 10Faidon Liambotis) [02:41:00] (03CR) 10Faidon Liambotis: "See:" [puppet] - 10https://gerrit.wikimedia.org/r/245490 (owner: 10Ori.livneh) [02:43:52] (03PS3) 10Faidon Liambotis: Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [02:43:54] (03PS2) 10Faidon Liambotis: Remove nfs::netapp::home(::othersite), unused [puppet] - 10https://gerrit.wikimedia.org/r/245613 [02:43:56] (03PS2) 10Faidon Liambotis: Remove nfs::common, replaced by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/245614 [02:43:58] (03PS2) 10Faidon Liambotis: Inline class nfs::data to snapshot::common [puppet] - 10https://gerrit.wikimedia.org/r/245615 [02:48:26] (03CR) 10Faidon Liambotis: [C: 032] Remove nfs::netapp::home(::othersite), unused [puppet] - 10https://gerrit.wikimedia.org/r/245613 (owner: 10Faidon Liambotis) [02:48:49] (03CR) 10Faidon Liambotis: [C: 032] Remove nfs::common, replaced by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/245614 (owner: 10Faidon Liambotis) [02:49:50] (03CR) 10Faidon Liambotis: [C: 032] Inline class nfs::data to snapshot::common [puppet] - 10https://gerrit.wikimedia.org/r/245615 (owner: 10Faidon Liambotis) [03:35:51] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1720607 (10chasemp) I rewrote the dump... [03:56:17] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1720620 (10mmodell) The underlying iss... [04:20:14] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: puppet fail [04:35:55] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: Puppet has 1 failures [04:47:03] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:04:24] RECOVERY - puppet last run on mw2122 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:14] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1720653 (1001tonythomas) @Jgreen can we see if things are working sometime tonight ( in ~10 hours ) ? We will have to pro... [05:16:34] PROBLEM - nutcracker process on mw1155 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:54] PROBLEM - dhclient process on mw1155 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:14] PROBLEM - HHVM processes on mw1155 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:23] PROBLEM - salt-minion processes on mw1155 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:23] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:24] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:53] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:04] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:04] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:05] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:35:44] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [07:40:44] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [09:24:15] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [09:51:13] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:04:08] (03PS1) 10Hoo man: Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) [10:07:45] (03PS2) 10Hoo man: Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) [10:36:06] (03PS3) 10Physikerwelt: vagarnt::mediawiki: Ensure clone before adding config [puppet] - 10https://gerrit.wikimedia.org/r/245207 (https://phabricator.wikimedia.org/T115229) (owner: 10BryanDavis) [10:37:49] (03CR) 10Physikerwelt: [C: 031] "The manual excution of the git clone command solved the problem." [puppet] - 10https://gerrit.wikimedia.org/r/245207 (https://phabricator.wikimedia.org/T115229) (owner: 10BryanDavis) [10:45:52] (03CR) 10Zhuyifei1999: [C: 031] "Correct logic." [puppet] - 10https://gerrit.wikimedia.org/r/245207 (https://phabricator.wikimedia.org/T115229) (owner: 10BryanDavis) [10:47:13] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [11:32:27] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1721062 (10Aklapper) [11:32:40] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1470600 (10Aklapper) [11:40:34] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [11:42:13] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [11:46:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [11:52:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [11:53:10] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1721105 (10Base) @jcrespo could you kindly point to where the correct usage of the API you are talking about is documented. I was trying to f... [11:54:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:07:23] (03PS1) 10KartikMistry: Enable suggestions in de, fa, fi, he, nn, pa, pl and te wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245862 [12:09:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:47] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1721179 (10jcrespo) @mmodell The graph... [12:17:01] (03PS2) 10KartikMistry: Enable suggestions in de, fa, fi, he, nn, pa, pl and te wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245862 (https://phabricator.wikimedia.org/T112848) [12:18:20] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1721183 (10Krenair) >>! In T109279#172... [12:30:44] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [12:54:16] (03PS3) 10Krinkle: Enable CX suggestions for de, fa, fi, he, nn, pa, pl and te wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245862 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [12:59:14] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:14] !log repooling cp2017 (codfw upload) [13:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:37] 6operations: cp2017 is down - https://phabricator.wikimedia.org/T114022#1721347 (10BBlack) 5Open>3Resolved No crashes or strange syslog/dmesg output since, repooling and closing, will re-open if it recurs. [13:29:31] (03PS1) 10Muehlenhoff: zim: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/245874 [13:39:04] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:25] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 4.683 second response time [13:44:09] (03PS1) 10Muehlenhoff: Restrict access to the deployment redis db to the internal network (plus silver) [puppet] - 10https://gerrit.wikimedia.org/r/245876 [13:47:53] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:26] (03Abandoned) 10Hashar: debian: fix lintian error about bad dist name [software/conftool] - 10https://gerrit.wikimedia.org/r/226910 (owner: 10Hashar) [13:50:32] (03CR) 10Ori.livneh: Provision Grafana 2 on grafana-test.wikimedia.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/245504 (https://phabricator.wikimedia.org/T104738) (owner: 10Ori.livneh) [13:51:14] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 3.084 second response time [13:52:12] (03PS3) 10Ori.livneh: Provision Grafana 2 on grafana-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245504 (https://phabricator.wikimedia.org/T104738) [13:55:09] (03PS4) 10Ori.livneh: Provision Grafana 2 on grafana-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245504 (https://phabricator.wikimedia.org/T104738) [13:55:19] (03CR) 10Ori.livneh: [C: 032 V: 032] "Faidon, thanks very much for the review / debugging / merges! <3" [puppet] - 10https://gerrit.wikimedia.org/r/245504 (https://phabricator.wikimedia.org/T104738) (owner: 10Ori.livneh) [13:56:21] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1721438 (10Jgreen) >>! In T114984#1720653, @01tonythomas wrote: > @Jgreen can we see if things are working sometime tonig... [13:56:39] !log on etherpad1001: restarting etherpad-lite [13:56:44] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:24] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.007 second response time [14:00:12] (03PS1) 10Ori.livneh: Fix-up for service stanza in Idd37460a [puppet] - 10https://gerrit.wikimedia.org/r/245877 [14:00:33] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for service stanza in Idd37460a [puppet] - 10https://gerrit.wikimedia.org/r/245877 (owner: 10Ori.livneh) [14:01:05] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [14:04:43] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:05:54] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection refused [14:07:03] PROBLEM - HHVM rendering on mw1155 is CRITICAL: Connection refused [14:11:17] !log restarted gitblit on antimony [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:16] !log note to self: we should migrate all our service to Java [14:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55047 bytes in 0.154 second response time [14:14:23] @bblack: Danke :0 [14:24:34] PROBLEM - Check size of conntrack table on mw1155 is CRITICAL: Connection refused by host [14:24:35] PROBLEM - DPKG on mw1155 is CRITICAL: Connection refused by host [14:24:43] PROBLEM - nutcracker port on mw1155 is CRITICAL: Connection refused by host [14:25:04] PROBLEM - RAID on mw1155 is CRITICAL: Connection refused by host [14:25:24] PROBLEM - Disk space on mw1155 is CRITICAL: Connection refused by host [14:25:44] PROBLEM - configured eth on mw1155 is CRITICAL: Connection refused by host [14:29:20] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1721480 (10ArielGlenn) a:3Cmjohnson [14:30:25] RECOVERY - RAID on mw1155 is OK: OK: no RAID installed [14:30:44] RECOVERY - Disk space on mw1155 is OK: DISK OK [14:30:45] RECOVERY - salt-minion processes on mw1155 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:30:45] RECOVERY - dhclient process on mw1155 is OK: PROCS OK: 0 processes with command name dhclient [14:31:04] RECOVERY - HHVM processes on mw1155 is OK: PROCS OK: 11 processes with command name hhvm [14:31:05] RECOVERY - configured eth on mw1155 is OK: OK - interfaces up [14:31:46] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.904 second response time [14:32:17] RECOVERY - nutcracker port on mw1155 is OK: TCP OK - 0.000 second response time on port 11212 [14:32:27] RECOVERY - nutcracker process on mw1155 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [14:33:37] RECOVERY - Check size of conntrack table on mw1155 is OK: OK: nf_conntrack is 1 % full [14:33:38] (03PS1) 10Ori.livneh: Add ini() to stdlib [puppet] - 10https://gerrit.wikimedia.org/r/245883 [14:33:44] Reedy , bblack: git.wikimedia.org isn't working for me again [14:33:56] RECOVERY - DPKG on mw1155 is OK: All packages OK [14:34:05] PuppyKun: It does beg the question of why you are using it [14:34:16] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 66611 bytes in 8.970 second response time [14:34:16] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:34:48] Reedy: What should I be using? I don't like extension distributor. I've been git cloning the 'repo summary' links for extensions I want [14:35:09] Depends what you're actually doing [14:35:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:22] ^ lmao [14:35:54] TIL: Users are quicker than icinga [14:35:54] git clone https://git.wikimedia.org/path/to/extension myserver/path/to/extensions [14:36:19] (03CR) 10Ori.livneh: [C: 032] Add ini() to stdlib [puppet] - 10https://gerrit.wikimedia.org/r/245883 (owner: 10Ori.livneh) [14:36:35] LOL [14:36:45] wikimedia bot was kicked from server for excess flood [14:36:58] Clone from gerrit, or github or even diffusion [14:37:05] I generally wouldn't recommend using git.wm.o for anything [14:37:40] perhaps we should just decomission git.wm.o? [14:37:48] seems alot better than saying "dont pick this 1 of the 4 options" [14:38:02] I'm not quite sure why we haven't exactly [14:38:12] I guess we will be if/when we move to phab CR etc [14:38:17] pending phabricator [14:38:34] also mediawiki.org extension page links to git [14:38:40] except for code review, which links to gerrit [14:38:56] but idk what the gerrit download link would be? unless its literally the same except gerrit.wm.o instead of git.wm.o [14:39:25] PuppyKun: https://gerrit.wikimedia.org/r/p/mediawiki/core.git [14:41:00] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1721609 (10RobH) [14:41:02] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1721608 (10RobH) [14:41:20] (03PS1) 10Ori.livneh: grafana 2: disable auth.basic [puppet] - 10https://gerrit.wikimedia.org/r/245885 [14:41:38] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana 2: disable auth.basic [puppet] - 10https://gerrit.wikimedia.org/r/245885 (owner: 10Ori.livneh) [14:41:59] yeah personally I just use github as our repo browser [14:43:13] ori: I wonder if we should move modules/librenms/lib/puppet/parser/functions/phpdump.rb to wmflib, btw [14:43:29] yes, we should [14:43:55] we should also factor the common logic out of phpdump, php_ini and ini into shared methods [14:44:06] and possibly make php_ini() just be ini() with some parameter [14:45:13] bblack: I doo too :) [14:45:21] JohnFLewis: hi :3 [14:45:28] hi [14:46:09] (03PS1) 10Milimetric: [WIP] Add a public endpoint for AQS [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) [14:46:14] hashar, others: what about extensions? Or could someone explain how to browse it on Gerrit? The search just shows a bunch of the chafes made to it etc [14:46:42] (03PS5) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [14:46:45] (03PS2) 10Andrew Bogott: Openstack: Don't notify keystone when the keystone policy changes [puppet] - 10https://gerrit.wikimedia.org/r/244349 [14:46:45] https://gerrit.wikimedia.org/r/#/admin/projects/ [14:46:46] (03PS1) 10Andrew Bogott: Designate: Explicitly turn off designate-pool-manager and designate-mdns on the inactive spare [puppet] - 10https://gerrit.wikimedia.org/r/245888 [14:47:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55020 bytes in 0.135 second response time [14:47:46] (03PS2) 10Andrew Bogott: Designate: Explicitly turn off designate-pool-manager and designate-mdns on the inactive spare [puppet] - 10https://gerrit.wikimedia.org/r/245888 [14:48:28] (03CR) 10Andrew Bogott: [C: 032] Designate: Explicitly turn off designate-pool-manager and designate-mdns on the inactive spare [puppet] - 10https://gerrit.wikimedia.org/r/245888 (owner: 10Andrew Bogott) [14:48:46] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1721665 (10RobH) We may want to allocate: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 64GB Memory, Dual 300GB SSD, H310 Mini Raid Card wmf3542 Since the H310 is useless for high dis... [14:49:11] 6operations, 10MobileFrontend, 10Traffic, 7Mobile, 10reading-web-sprint-58-6: ml.wikipedia.org not redirecting to mobile site while accessing from a mobile device; many "Error: Module not found" errors - https://phabricator.wikimedia.org/T115191#1721666 (10BBlack) Yeah that must be it, will make a patch. [14:50:32] 6operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#1721674 (10Dzahn) a:3Dzahn [14:50:45] 6operations, 7Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1721676 (10Dzahn) a:3akosiaris [14:51:09] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1721677 (10RobH) p:5Triage>3Normal [14:52:00] 6operations, 7Monitoring: Fix torrus to not destroy stats when varnish is restarted - https://phabricator.wikimedia.org/T79127#1721685 (10Dzahn) 5Open>3declined a:3Dzahn declining per T87840 [14:52:23] (03PS1) 10Dzahn: torrus: remove role from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/245890 (https://phabricator.wikimedia.org/T87840) [14:53:09] (03PS1) 10BBlack: Fix mobile direct for m.* langs - T115191 [puppet] - 10https://gerrit.wikimedia.org/r/245891 [14:53:22] (03PS1) 10Ori.livneh: grafana2: rewrite REMOTE_USER envvar into X-WEBAUTH-USER header [puppet] - 10https://gerrit.wikimedia.org/r/245892 [14:53:31] (03PS2) 10Ori.livneh: grafana2: rewrite REMOTE_USER envvar into X-WEBAUTH-USER header [puppet] - 10https://gerrit.wikimedia.org/r/245892 [14:54:04] (03PS4) 10Yuvipanda: vagarnt::mediawiki: Ensure clone before adding config [puppet] - 10https://gerrit.wikimedia.org/r/245207 (https://phabricator.wikimedia.org/T115229) (owner: 10BryanDavis) [14:54:07] (03CR) 10BBlack: "I'm not 100% sure if \b (which is a form of positive assertion) is valid inside of (?!) (which is a negative lookahead assertion). Need t" [puppet] - 10https://gerrit.wikimedia.org/r/245891 (owner: 10BBlack) [14:54:18] (03CR) 10Yuvipanda: [C: 032 V: 032] vagarnt::mediawiki: Ensure clone before adding config [puppet] - 10https://gerrit.wikimedia.org/r/245207 (https://phabricator.wikimedia.org/T115229) (owner: 10BryanDavis) [14:54:20] (03CR) 10Ori.livneh: [C: 032] grafana2: rewrite REMOTE_USER envvar into X-WEBAUTH-USER header [puppet] - 10https://gerrit.wikimedia.org/r/245892 (owner: 10Ori.livneh) [14:54:37] 6operations: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) 3NEW [14:56:56] ori: did you puppet merge mine too? [14:57:16] yuvipanda: not too, but instead [14:57:21] mine didn't merge because it needed another rebase [14:57:25] just realized that [14:57:42] ah [14:57:44] heh [14:57:46] ok [14:57:50] sorry about that. [14:59:15] np, my bad [14:59:23] (03PS3) 10Ori.livneh: grafana2: rewrite REMOTE_USER envvar into X-WEBAUTH-USER header [puppet] - 10https://gerrit.wikimedia.org/r/245892 [14:59:36] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana2: rewrite REMOTE_USER envvar into X-WEBAUTH-USER header [puppet] - 10https://gerrit.wikimedia.org/r/245892 (owner: 10Ori.livneh) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151013T1500). Please do the needful. [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:09] (03CR) 10Mobrovac: [C: 04-1] "Giving -1 on the basis of:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) (owner: 10Milimetric) [15:02:18] I'm here. [15:02:23] who is SWAT'ng? [15:03:18] kart_: I can SWAT [15:04:14] cool [15:05:59] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1721767 (10Bawolff) I would say we should redirect to correct url instead of 400. >>! In T106517#1721105, @Base wrote: > @jcrespo could yo... [15:07:02] whoa there, big errors in fatalmonitor this morning: Undefined variable: lang in /srv/mediawiki/wmf-config/InitialiseSettings.php on line 1209 [15:07:26] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:23] looks like it's caused by this: https://gerrit.wikimedia.org/r/#/c/243920/ [15:08:33] Glaisher: Krenair ^ [15:09:05] thcipriani: Just revert it [15:09:19] Reedy: kk [15:12:55] Hi. Comment added to https://phabricator.wikimedia.org/T111335#1721775 [15:14:12] (03PS1) 10Thcipriani: Revert "Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245897 [15:15:06] thcipriani: want me to +2? [15:15:15] Reedy, please [15:15:24] (03CR) 10Reedy: [C: 032] Revert "Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245897 (owner: 10Thcipriani) [15:15:30] (03Merged) 10jenkins-bot: Revert "Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245897 (owner: 10Thcipriani) [15:15:52] !log grafana-test: Imported Grafana dashboards from ElasticSearch [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:02] thcipriani: Looking at the other usages of $lang in IS, they're all in single quotes. Which I guess is the problem [15:16:09] Will comment on the original revision [15:16:11] Reedy: yeah, seems that way. [15:16:43] (03CR) 10Reedy: "Needs to be remade with '$lang' not "$lang" based on other usages in IS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [15:18:13] 6operations, 7Graphite, 5Patch-For-Review: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1721789 (10ori) a:3ori Grafana 2 is now running on https://grafana-test.wikimedia.org/. I imported all existing Grafana dashboards from our Grafana 1.x installation. The next step is to test it... [15:18:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default" (duration: 01m 13s) [15:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:39] fatalmonitor seems to be going down. Reedy Dereckson thanks for your help! [15:18:48] np [15:19:23] kart_: now you're up :) [15:21:05] :) [15:22:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245862 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:22:09] (03Merged) 10jenkins-bot: Enable CX suggestions for de, fa, fi, he, nn, pa, pl and te wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245862 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:25:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable CX suggestions for de, fa, fi, he, nn, pa, pl and te wikipedias [[gerrit:245862]] (duration: 01m 13s) [15:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:05] ^ kart_ check please [15:26:18] Sure [15:31:22] thcipriani: thanks! looks fine! [15:31:31] kart_: awesome. Thanks for checking. [15:36:08] (03PS1) 10Ori.livneh: graphite: update CORS regex for grafana-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245915 [15:36:42] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite: update CORS regex for grafana-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/245915 (owner: 10Ori.livneh) [15:43:38] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures [15:44:58] Krenair: Was it you that moved the dblists into their own sub-directory? It broke Beta Cluster – https://phabricator.wikimedia.org/T115302 [15:46:58] James_F, Reedy and ori did it [15:47:18] Krenair: Gosh, clean-up and good things done by people other than you?! ;-) [15:47:22] Also, meh. [15:47:31] ah, i'll fix [15:47:46] Probably just needs a six-character change in scap. [15:49:18] James_F: all changes to scap are composed of four-letter groups [15:49:20] (03PS1) 10Ori.livneh: update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 [15:49:27] * James_F grins. [15:49:27] bd808: ^ [15:50:03] bd808: Like "darn" and "gosh"? [15:50:06] ori: is that the only place we grab a dblist directly? [15:50:10] (03CR) 10Alex Monk: [C: 032] update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 (owner: 10Ori.livneh) [15:50:14] James_F: yeah and fsck [15:50:33] as far as i could see [15:51:12] bd808, only remaining one that I could find... [15:51:40] excellent. who wants to deploy it? [15:51:51] I haven't deployed a scap change yet [15:51:57] Since I +2'd it I should probably figure out how that's done [15:52:16] Krenair: sounds like a plan. It's a trebuchet deploy from deployment-bastion [15:52:34] there will be some hosts that don't complete (4/10 as I recall) [15:53:16] The prod scap deploy is quite a bit behind HEAD too. I pestered twentyafterfour a bit about that last night [15:53:34] (03Abandoned) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 (owner: 10Ottomata) [15:54:10] (03PS1) 10BryanDavis: [WIP] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 [15:54:26] ori: I fixed these in https://gerrit.wikimedia.org/r/#/c/244743/ [15:54:44] ah nice, reviewing [15:55:04] (03PS2) 10Greg Grossmeier: update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 (owner: 10Ori.livneh) [15:57:11] (03Abandoned) 10Ottomata: Increase critical threshold of varnishkafka drerr alert [puppet] - 10https://gerrit.wikimedia.org/r/219399 (owner: 10Ottomata) [15:59:00] bd808, are people contributing to this repository from phab and gerrit at the same time? [15:59:14] (03PS3) 10BryanDavis: update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 (https://phabricator.wikimedia.org/T115302) (owner: 10Ori.livneh) [15:59:22] (03CR) 10BryanDavis: [C: 032] update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 (https://phabricator.wikimedia.org/T115302) (owner: 10Ori.livneh) [15:59:27] Krenair: no idea [15:59:51] Krenair: bd808 to scap? yeah [16:01:30] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1721971 (10Andrew) Let's skip the Horizon box for now -- we can consolidate horizon services on the controller node or buy one later on. [16:01:49] Krenair: bd808 https://phabricator.wikimedia.org/differential/query/all/ [16:02:17] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1721985 (10Andrew) As discussed at off-site: Mark approves of this, thinks that reusing the 3 off-warranty boxes and the 2 almost-expired boxes won't c... [16:03:07] greg-g, how do the permissions for that repository work in phab? [16:03:45] how can we have both gerrit and phab working for a repo at the same time? [16:04:18] differential is just a code review tool. It's doesn't manage the repo [16:04:31] so once you approve in differential, you have to send it through gerrit? [16:04:51] I think so, yes. But I could be talking out of my ear [16:07:08] (03PS1) 10Ori.livneh: grafana2: users.auto_assign_org_role = Editor [puppet] - 10https://gerrit.wikimedia.org/r/245923 [16:07:10] (03Merged) 10jenkins-bot: update scap for dblists/* change [tools/scap] - 10https://gerrit.wikimedia.org/r/245917 (https://phabricator.wikimedia.org/T115302) (owner: 10Ori.livneh) [16:08:01] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana2: users.auto_assign_org_role = Editor [puppet] - 10https://gerrit.wikimedia.org/r/245923 (owner: 10Ori.livneh) [16:08:03] how do you typically test scap changes bd808? [16:08:08] !log restarting diamond on cp1052 in gdb in attempt to figure out why vanrishreqstats segfaults... [16:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:46] Krenair: oh, I think all the things they are doing in differential are on a branch [16:09:08] there's a bunch of commits clearly from differential on master [16:09:26] !log created education program tables for srwiki, T110619 [16:09:27] Krenair: I have a scabby testing vm and I also often cherry-pick to the beta cluster and try things there [16:09:46] ostriches, I thought I -2'd that? [16:09:58] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:10:37] (03PS1) 10Ori.livneh: misc-varnish: no caching of grafana requests [puppet] - 10https://gerrit.wikimedia.org/r/245926 [16:11:00] (03CR) 10Ori.livneh: [C: 032 V: 032] misc-varnish: no caching of grafana requests [puppet] - 10https://gerrit.wikimedia.org/r/245926 (owner: 10Ori.livneh) [16:12:57] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [16:12:59] Krenair: https://phabricator.wikimedia.org/T110619#1714416 [16:13:36] ostriches, yes, I saw that. [16:13:38] I did not remove my -2. [16:13:54] Please remove it then :) [16:14:06] !log restbase deploying a01c62a6 [16:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:24] Grr, why wasn't I logged? [16:14:37] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [16:14:45] !log created education program tables for srwiki, T110619 [16:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:05] tyvm [16:17:07] oh, there's a local patch on tin bd808 [16:17:47] ACKNOWLEDGEMENT - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 2 failures andrew bogott Andrew broke puppet. Bug T115347 [16:18:08] the check for content before I think that was needed for some sync-dir operation (extension update?) [16:22:40] 6operations: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [16:26:13] 6operations: Create an upload queue for reprepro - https://phabricator.wikimedia.org/T115349#1722053 (10MoritzMuehlenhoff) 3NEW [16:27:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:50:04] (back) [16:50:57] Krenair: Did you deploy the scap fix in the end? [16:51:28] not yet :/ [16:51:36] * James_F nods. [16:51:38] there's some uncommitted change on tin [16:51:45] I'm not sure what to do with it [16:52:43] * James_F nods. [16:54:09] (03PS1) 10Glaisher: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245945 (https://phabricator.wikimedia.org/T111335) [16:54:51] (03CR) 10Glaisher: "https://gerrit.wikimedia.org/r/#/c/245945/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [16:55:36] ori: unmerged patch on strontium you? [16:55:44] Krenair: Is it in production, or just on tin? [16:56:07] well tin is the main deployment server, so... [16:58:08] In that case, cherry-pick the uncommmited change on top of the new master? [17:00:53] are those tmh* hosts supposed to still be in the list of minions for scap? [17:02:44] Krenair: for the uncommitted change to scap I would commit locally then rebase on master then try to find out who made the hack and why. [17:03:04] my guess would be twentyafterfour needed it to sync-dir something [17:03:35] I don't think I made any scap hacks [17:04:46] Krenair: stash the change? [17:06:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [17:10:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [17:10:20] 6operations, 6Analytics-Kanban, 10Traffic: Flag in x-analytics in varnish any request that comes with no cookies whatsoever - https://phabricator.wikimedia.org/T114370#1722291 (10Milimetric) [17:10:44] 6operations, 6Analytics-Kanban, 10Traffic: Flag in x-analytics in varnish any request that comes with no cookies whatsoever - https://phabricator.wikimedia.org/T114370#1722297 (10Nuria) https://gerrit.wikimedia.org/r/#/c/244626/ [17:12:10] 6operations, 6Analytics-Kanban, 10Traffic: Flag in x-analytics in varnish any request that comes with no cookies whatsoever [5 pts] - https://phabricator.wikimedia.org/T114370#1722305 (10Milimetric) [17:13:44] 6operations, 6Analytics-Kanban, 10Traffic: Flag in x-analytics in varnish any request that comes with no cookies whatsoever {bear} [5 pts] - https://phabricator.wikimedia.org/T114370#1722320 (10kevinator) [17:14:13] Is git.wikimedia.org down for you too? [17:15:17] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review, 10reading-web-sprint-58-6: ml.wikipedia.org not redirecting to mobile site while accessing from a mobile device; many "Error: Module not found" errors - https://phabricator.wikimedia.org/T115191#1722327 (10Krenair) [17:22:37] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: Puppet has 1 failures [17:24:03] SPF|Cloud: yes [17:24:19] yay. [17:25:06] 6operations: git.wikimedia.org down - https://phabricator.wikimedia.org/T115363#1722386 (10Steinsplitter) 3NEW [17:26:50] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1722396 (10doctaxon) 5Resolved>3Open I'd like to join Project-Creators as TaxonBot. I will create any bots for commons, de.wp and other wi... [17:32:49] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1722432 (10Aklapper) 5Open>3Resolved (Task status is unrelated to requests in comments here) [17:34:07] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: puppet fail [17:37:45] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1722454 (10VBaranetsky) Great. Sorry for the delay. Doneva Daggett is our contact at Mark Monitor. I'll ask her to do that. Best, Vickie [17:39:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:42:00] !log performing canary deploy of a4c55e4 to restbase staging (xenon) [17:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:34] bd808, so I tried running the scap deployment [17:45:52] there's an error from salt... [17:47:56] PROBLEM - salt-minion processes on elastic1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:19] PROBLEM - salt-minion processes on mw1251 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:27] PROBLEM - salt-minion processes on db1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:38] PROBLEM - salt-minion processes on mw2034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:48] PROBLEM - salt-minion processes on mw1158 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:48] PROBLEM - salt-minion processes on cp2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:48:57] I'm guessing that's why. [17:49:10] Krenair: not much I can do about even helping debug salt problems. salt needs root help to figure out what's wrong in my experience [17:49:33] salt.exceptions.SaltClientError: Attempt to authenticate with the salt master failed [17:49:37] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:49:42] Repo: scap/scap [17:49:42] Tag: scap/scap-sync-20151013-174409 [17:49:46] 78/484 minions completed fetch [17:49:47] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:49:58] PROBLEM - salt-minion processes on mw2211 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:50:07] PROBLEM - salt-minion processes on mw2013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:50:07] PROBLEM - salt-minion processes on mw2083 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:52:57] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:54:54] !log performing deploy of a4c55e4 to restbase staging [17:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:38] RECOVERY - salt-minion processes on mw1251 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:57:08] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1722555 (10intracer) >>! In T706#1596692, @Ainali wrote: > I like to join #Project-Creators to be able to setup projects for Wikimedia Sverige... [17:57:17] RECOVERY - salt-minion processes on mw1158 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:57:17] RECOVERY - salt-minion processes on cp2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:58:06] RECOVERY - salt-minion processes on elastic1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:58:35] !log testing ldap integration in grafana 2 [17:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151013T1800). [18:00:27] RECOVERY - salt-minion processes on db1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:01:17] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:02:19] RECOVERY - salt-minion processes on mw2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:02:43] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1722564 (10Aklapper) >>! In T706#1722396, @doctaxon wrote: > I'd like to join Project-Creators as TaxonBot. I will create any bots for commons... [18:03:51] (03PS4) 10Nuria: Mark incoming requests without cookies in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/244626 (https://phabricator.wikimedia.org/T114370) [18:05:36] RECOVERY - salt-minion processes on mw2013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:06:14] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1722582 (10Aklapper) >>! In T706#1722555, @intracer wrote: > I'd want the same for Wikimedia Ukraine. I asked before for a project with subroj... [18:07:07] RECOVERY - salt-minion processes on mw2211 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:07:23] (03PS2) 10BBlack: Fix mobile direct for m.* langs - T115191 [puppet] - 10https://gerrit.wikimedia.org/r/245891 [18:08:13] (03CR) 10BBlack: [C: 032] "Tested w/ pcre" [puppet] - 10https://gerrit.wikimedia.org/r/245891 (owner: 10BBlack) [18:08:23] (03CR) 10BBlack: "Tested w/ pcre" [puppet] - 10https://gerrit.wikimedia.org/r/245891 (owner: 10BBlack) [18:08:57] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 8.86470346939 [18:09:43] !log restarted gitblit [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:10:47] (03PS5) 10Nuria: Mark incoming requests without cookies in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/244626 (https://phabricator.wikimedia.org/T114370) [18:11:19] !log started salt on mw2083 [18:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:37] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY:0.0129243877551 [18:12:08] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55044 bytes in 0.221 second response time [18:12:16] RECOVERY - salt-minion processes on mw2083 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:14:21] it works better now [18:14:24] 439/484 minions completed fetch [18:16:06] 6operations, 10ops-codfw: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1722630 (10Papaul) 3NEW a:3Papaul [18:25:56] Krenair: in theory the other host should catch up "eventually". There is puppet stuff that checks to see if the repo matches the deploy server [18:26:48] I don't know which changes were in the set, but often things that get changed in scap really only effect the deploy origin server itself (eg tin) [18:27:13] the work that is done on the target hosts is generally pretty simple [18:30:22] mw1259.eqiad.wmnet: [18:30:22] fetch status: None [started: 0 mins ago, last-return: None mins ago] [18:30:36] hosts in there which don't exist like tmh* [18:30:42] mw1260.eqiad.wmnet: [18:30:42] fetch status: None [started: 0 mins ago, last-return: None mins ago] [18:30:50] tin.eqiad.wmnet: [18:30:50] fetch status: 0 [started: 83 mins ago, last-return: 1240 mins ago] [18:30:50] mira.codfw.wmnet: [18:30:50] fetch status: 0 [started: 40 mins ago, last-return: 1240 mins ago] [18:30:55] various others [18:32:09] mw1260 was previously called tmh1002 [18:32:22] mw1259 was tmh1001 [18:33:44] bd808: I merged your vagrant patch earlier :) [18:34:01] got rid of the tmh* ones from redis [18:45:09] bd808, I'm kind of suspicious that all of the snapshot hosts and both of the newly reinstalled videoscalers are having issues with this [18:45:10] !log FYI i have puppet disabled on cp1052 while I try to figure out a diamond+VSL+multiprocessing bug [18:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:23] makes me wonder if something is wrong in puppet [18:46:34] (03PS1) 10Muehlenhoff: Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 [18:46:57] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1722718 (10Tgr) See [[ https://github.com/wikimedia/mediawiki/blob/a2d6ecc4539e60501803155990ec36575bdb4332/includes/filerepo/FileRepo.php#L1... [18:49:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:55] (03PS1) 10Muehlenhoff: Move base::firewall into the syslog role [puppet] - 10https://gerrit.wikimedia.org/r/245959 [18:53:37] I guess there might be something useful in /var/log/upstart/salt-minion.log but only root can read that [18:54:53] (03PS1) 10Muehlenhoff: Move base::firewall include into the racktables and rt roles [puppet] - 10https://gerrit.wikimedia.org/r/245962 [18:56:08] (03PS1) 1020after4: symlinks for 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245963 [18:56:49] (03CR) 1020after4: [C: 032] symlinks for 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245963 (owner: 1020after4) [18:56:55] (03Merged) 10jenkins-bot: symlinks for 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245963 (owner: 1020after4) [18:57:59] (03PS1) 10Muehlenhoff: Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 [18:58:49] !log twentyafterfour@tin Started scap: Full scap sync for 1.27.0-wmf.3 [18:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:33] (03PS1) 10Muehlenhoff: Move base::firewall include into the otrs role [puppet] - 10https://gerrit.wikimedia.org/r/245965 [19:04:17] * Krenair gives up [19:04:20] (03PS1) 10Muehlenhoff: Move base::firewall includes for roles on krypton [puppet] - 10https://gerrit.wikimedia.org/r/245968 [19:05:11] it was at 467/482 minions completed fetch [19:06:22] put the repo on tin back how I found it [19:08:27] 482 minions? :-) [19:08:41] That's quite a following. [19:09:02] (03PS1) 10Muehlenhoff: Move base::firewall include into the mx role [puppet] - 10https://gerrit.wikimedia.org/r/245969 [19:10:13] If only the whole group responded... [19:11:24] Surely you are not suggesting that salts may not be entirely reliable? [19:11:32] salt* [19:13:10] (03PS1) 10Muehlenhoff: Include base::firewall into the planet role [puppet] - 10https://gerrit.wikimedia.org/r/245970 [19:13:18] I was kind of expecting the stuff ops used to manage production servers to be reliable. [19:13:33] I've only really used salt a bit before, in the deployment-prep labs project, so... [19:13:44] (03PS1) 10Ori.livneh: grafana2: don't allow users to create orgs or accounts [puppet] - 10https://gerrit.wikimedia.org/r/245971 [19:14:08] (03PS1) 10Muehlenhoff: Move base::firewall include into the openldap::corp role [puppet] - 10https://gerrit.wikimedia.org/r/245972 [19:14:20] (03PS2) 10Ori.livneh: grafana2: don't allow users to create orgs or accounts [puppet] - 10https://gerrit.wikimedia.org/r/245971 [19:14:28] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana2: don't allow users to create orgs or accounts [puppet] - 10https://gerrit.wikimedia.org/r/245971 (owner: 10Ori.livneh) [19:15:49] Krenair: Oh, we hate salt too. :-) [19:18:48] fwiw I think iscap could be a viable replacement for remote command execution [19:19:09] (03PS1) 10Muehlenhoff: Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 [19:19:48] not really [19:20:04] ? [19:20:06] i don't love salt, but a viable replacement would have to provide the ability to use queries to select hosts [19:20:16] based on some attribute / role / property [19:20:25] ori: that's the part that I'm still working on [19:20:42] mainly it needs a source for the data that would be queried [19:20:44] (03PS1) 10Muehlenhoff: Move base::firewall include into the gerrit::production role [puppet] - 10https://gerrit.wikimedia.org/r/245975 [19:20:50] salt uses facter [19:20:58] see the output of 'facter -p' for any server [19:20:59] <_joe_> ori: no [19:21:10] but this seems kinda crazy, tbh [19:21:10] <_joe_> ori: salt uses grains we collect via puppet [19:21:17] _joe_: oh, right [19:21:19] <_joe_> it's kinda awkward [19:21:30] salt used to use facter [19:21:31] would etcd have some of the same info? [19:21:50] probably not [19:22:19] <_joe_> no. [19:22:20] iscap wouldn't be able to query the individual machines so it would need some central db to query [19:22:31] <_joe_> twentyafterfour: for doing what? [19:22:37] (I mean, it could connect to each host just to ask it what it's facts are but that would be hella slow) [19:22:45] <_joe_> twentyafterfour: this is a thing you should discuss with us [19:23:03] * ori cues tom waits's "what's he building in there?" [19:23:06] _joe_: for remote command execution. This is just an idea in my head right now, but yeah, I'm discussing it with you now ;) [19:23:23] <_joe_> twentyafterfour: I can't atm, sorry [19:23:24] https://www.youtube.com/watch?v=JaLjwSpZ6Cs [19:23:26] <_joe_> :) [19:24:23] _joe_: I'd love to talk to you about it, whenever you do have time. [19:25:11] <_joe_> twentyafterfour: heh, probably next week, we're at the ops offsite [19:25:23] oh right, I forgot about that [19:25:34] well have fun :) [19:25:38] <_joe_> but yeah, we should [19:25:54] <_joe_> twentyafterfour: we're talking workflow metrics atm [19:30:16] !log twentyafterfour@tin Finished scap: Full scap sync for 1.27.0-wmf.3 (duration: 31m 26s) [19:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:35:17] Krenair, bd808: Still no luck deploying that scap change? :-( [19:35:25] no, I gave up [19:35:47] some servers don't seem to be dealing with the deployment system properly [19:35:52] Or at all? [19:35:58] that too [19:36:12] :-( [19:36:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [19:38:25] !log delete wikidata-l mailing list (archives still accessible) [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:42] time to make scap deploy scap. Or package it. Probably the latter. [19:40:12] twentyafterfour: scap scap -- nice name :P [19:40:27] twentyafterfour: I think that's a "yo dawg" solution if I ever heard one :) [19:41:01] package is? meh [19:41:12] that's going to have a lot of overhead for little benefit [19:42:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [19:47:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [19:48:16] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245981 [19:49:09] paravoid: T114981 [19:49:38] (03PS1) 10Ori.livneh: Add grafana-admin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/245983 [19:50:10] twentyafterfour: well I filed that didn't I :) [19:50:34] well that's the reasoning for packaging it ;) [19:51:36] that's not very different than saying "only root can deploy scap" (either with trebuchet or scap itself) [19:51:46] it has the same detrimental effect essentially [19:52:14] paravoid: yeah and we haven't been able to think of anything that doesn't essentially have the same result [19:52:42] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245981 (owner: 1020after4) [19:52:48] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245981 (owner: 1020after4) [19:52:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [19:58:31] paravoid: git.wm.o could use with a kick on antimony (seemingly as you're the latest opsen around :) ) [19:58:38] ref. https://phabricator.wikimedia.org/T115363 [19:59:01] I'm not gonna do that [19:59:24] this has been broken for almost two? years now [20:00:06] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.3 [20:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:19] yes, but on the otherhand people do rely on it and ops have no said anything against no longer supporting it/provide an alternative actively besides diffusion which is seemingly not ready yet [20:00:25] git.wm.o should finally be killed very soon [20:00:43] "we are no longer supporting it" [20:00:46] is that enough? :) [20:00:48] lol :D [20:01:23] shouldn't that happen when the thing it working? :) [20:01:26] PROBLEM - Host mw1157 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:26] *is [20:01:28] and ops didn't set it up in the first place so it doesn't fall on us to provide an alternative [20:01:38] 6operations: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722961 (10Dzahn) Starting with some things doing this that we might want to change: NTP -> modules/ntp/ ; package { 'ntp': ensure => latest } Puppet -> modules/base/ ; package { [ 'puppet', 'facter' ]:.. Interface -> modul... [20:01:44] diffusion is ready, fwiw [20:02:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [20:02:17] what are we waiting for then? links to be updated? [20:02:37] a few repos might still be missing but other than redirecting the old URLs to phabricator, I think all the serious work is done on the gitblit-deprecate project [20:02:59] Krenair: we wanted to provide URL redirects which haven't been implemented yet [20:03:37] hi ops [20:03:59] in case you don't know already i can't log in to mediawiki.org -- (Cannot access the database: Unknown database 'mediawikiwiki' (10.64.16.28)) [20:04:31] ewk [20:04:34] looking [20:04:40] eeeek [20:05:06] cwdent: thanks for bringing this to attention. [20:05:16] checking [20:05:17] np! [20:05:20] I just deployed the train to mediawiki.org so it's gotta be something with the new branch [20:05:28] (well, most likely) [20:05:38] also, did dblists get moved recently? [20:05:49] that might be the cause [20:05:52] krenair@terbium:~$ sql mediawikiwiki -h db1039 [20:05:52] ERROR 1049 (42000): Unknown database 'mediawikiwiki' [20:05:53] its about to be moved, not sure that's gone through yet [20:05:55] * apergos checks [20:06:15] db1039 is an s7 slave [20:06:43] but mediawikiwiki is not supposed to be in s7 [20:06:49] ori: moved dblists earleir [20:06:51] https://gerrit.wikimedia.org/r/#/c/244743/ not yet merged [20:06:53] *earlier [20:06:53] it's s3 [20:06:57] oh? [20:06:59] yes [20:07:03] yeah what Krenair said [20:07:14] I saw something get merged yesterday(?) [20:07:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [20:08:07] nvm me, I can't find it? [20:08:13] umm [20:08:24] oh [20:08:26] oh, nevermind. [20:08:26] https://github.com/wikimedia/operations-mediawiki-config/commit/292f4e748e5283a52f086c0ea45b382aef5f4035 [20:08:28] and similar [20:08:30] the actual location of the dblist files was moved even if the other stuff didn't merge yet [20:08:54] ^^ [20:09:17] /srv/mediawiki/dblists that's where they got moved to indeed [20:09:24] having just hopped on a random app server [20:09:54] yeah I had to change my deployment script... so is mediawiki still looking for it in the root? [20:10:16] i don't think that's it [20:10:28] it worked fine on testwiki at first but now testwiki is broken too: https://test.wikipedia.org/wiki/Main_Page [20:10:44] ori: it looks like that's it to me [20:11:06] where is the code that is looking for the dblist in the wrong place? [20:11:47] wmf-config/db-eqiad.php [20:11:51] on mediawiki-config [20:12:26] but it is ok there [20:13:01] <_joe_> so both servers are on s7? [20:13:02] oh, ignore me [20:13:10] <_joe_> jynus: is that correct? [20:13:13] _joe_, both are in s3 [20:13:24] but s7 is tried [20:13:31] <_joe_> ok [20:13:55] [tin:~] $ mwscript eval.php mediawikiwiki [20:13:55] > echo wfGetLB()->getServerName(0); [20:13:55] db1038 [20:13:58] that's correct, no? [20:14:11] i think this might be centralauth [20:14:11] ori, yes [20:14:15] that's the s3 master, yes [20:14:24] probably. centralauth would be connecting to s7 [20:14:30] correct [20:14:40] <_joe_> are both the wikis in group0? [20:14:53] are testwiki/test2wiki/zerowiki/testwikidatawiki affected? [20:14:57] but the actual error is Unknown database 'mediawikiwiki' [20:15:11] _joe_: yes group0 [20:15:15] testwiki is [20:15:24] If CA was selecting the wrong DB host, it wouldn't be caught in dev or beta [20:15:25] Krenair: I don't know about zerowiki ... [20:15:33] or wrong DB name [20:15:38] <_joe_> so shouldn't we revert the last scap? [20:15:44] yes [20:15:48] +1 [20:15:51] <_joe_> please do. [20:16:01] https://gerrit.wikimedia.org/r/#/c/205528/ and https://gerrit.wikimedia.org/r/#/c/232322/ would be pretty suspcious [20:16:04] btu neither has been merged [20:16:11] (03PS1) 10Merlijn van Deen: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) [20:16:26] (03PS2) 10Merlijn van Deen: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) [20:16:32] <_joe_> ori: we've just promoted group 0 to a new mediawiki version, that sounds suspicious [20:16:33] (03PS3) 10Merlijn van Deen: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) [20:16:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [20:16:50] _joe_: I can roll it back but won't that make it hard to debug? [20:16:53] this happens since exactly the last sync_wikiversions [20:16:55] I can roll back mediawiki and leave testwiki [20:16:56] mediawiki.org has been slow, then I got "(Cannot access the database: Unknown database 'mediawikiwiki' (10.64.48.15))". I'm sure top minds are on it, thanks! [20:16:59] (03PS1) 10Ori.livneh: Revert "group0 wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246073 [20:17:02] <_joe_> twentyafterfour: mediawiki.org is now down [20:17:07] (03CR) 10Ori.livneh: [C: 032] Revert "group0 wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246073 (owner: 10Ori.livneh) [20:17:10] fine, leave testwiki [20:17:10] <_joe_> we can't really "debug" that [20:17:13] (03CR) 10Merlijn van Deen: "Tested on toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) (owner: 10Merlijn van Deen) [20:17:15] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246073 (owner: 10Ori.livneh) [20:17:16] _joe_: right... [20:18:09] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246074 [20:18:11] <_joe_> mediawiki.org is back? [20:18:19] yeah my sync-wikiversions is at 99% [20:18:21] stuck on two hosts [20:18:22] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246074 (owner: 1020after4) [20:18:23] <_joe_> twentyafterfour: ori did it I think [20:18:25] thanks ori [20:18:28] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246074 (owner: 1020after4) [20:18:34] <_joe_> ori: one is an imagescaler and it's just dead [20:18:37] error rate went down [20:18:38] heh, conflicting reverts? [20:18:41] !log ori@tin rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [20:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:06] looks good [20:19:39] let's just set all group0 wikis to 1.27.0-wmf.3 on mw1017 only [20:19:45] that way we can debug all wikis using x-wikimedia-debug [20:19:56] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.2 [20:19:57] +1 [20:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 1 below the confidence bounds [20:20:21] <_joe_> that sounds a bit messy, but as long as noone is releasing anything that might work [20:21:03] ssh: connect to host mw1157.eqiad.wmnet port 22: Connection timed out [20:21:05] mw1017 is often used like this [20:21:23] <_joe_> twentyafterfour: yeah that's what I was saying earlier [20:21:38] <_joe_> it is down, still need to connect to its mgmt to bring it up [20:22:04] !log locally hacked testwiki and mediawikiwiki to point to php-1.27.0-wmf.3 on mw1017 [20:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:11] is this then contained, then? [20:22:16] yes [20:22:30] thanks to the original reporter [20:22:38] (cant find the name) [20:23:36] going to check back out (meeting in progress) [20:23:53] (03CR) 10Yuvipanda: [C: 031] toollabs: add composer to dev hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) (owner: 10Merlijn van Deen) [20:24:05] jynus: it was cwdent [20:26:43] yeah, so, it's centralauth [20:26:44] https://dpaste.de/3pKT/raw [20:26:46] Getting it consistently on testwiki, though [20:26:57] 6operations, 10Gitblit: git.wikimedia.org down - https://phabricator.wikimedia.org/T115363#1723127 (10Aklapper) gitblit was restarted earlier already in https://wikitech.wikimedia.org/wiki/Server_Admin_Log but seems to fail again. [20:27:04] jynus: because !log locally hacked testwiki and mediawikiwiki to point to php-1.27.0-wmf.3 on mw1017 [20:27:13] all testwiki reqs go to mw1017 [20:27:15] <_joe_> !log rebooting mw1157, stuck in a kernel soft lockup [20:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:20] ok, sorry, I though that was only a percentage [20:27:25] sorry about that [20:28:31] (03PS4) 10Yuvipanda: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) (owner: 10Merlijn van Deen) [20:29:26] RECOVERY - Host mw1157 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [20:29:40] <_joe_> I'll run sync-common on it [20:36:15] ori: I don't see any changes in CentralAuth that could cause this. Did you find any more clues? [20:37:47] might be in core LoadBalancer code? [20:37:57] kinda suspecting I560ebd19c4eb2b3a040d4331702346440617cfaa [20:37:58] yeah [20:38:46] (03PS1) 10Andrew Bogott: service_unit: Add a service_running arg [puppet] - 10https://gerrit.wikimedia.org/r/246081 (https://phabricator.wikimedia.org/T115347) [20:39:12] yep [20:39:15] yeah I saw that before [20:39:19] that was it [20:39:25] but disregarded it :/ [20:39:29] or discarded it I guess [20:39:31] if i comment out "# Make master connections read only if in lagged slave mode" everything is back [20:39:49] dammit I was there 15' ago [20:40:01] Jeff_Green: we can hear you drumming ;] [20:40:15] ori moved the dblists [20:40:20] it was probably ori, ori ori ori [20:40:26] I never said that [20:40:28] !log restart gitblit (once more) [20:40:29] fwiw [20:40:31] i know, i'm just kidding [20:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:45] I discarded that early on too, since the mediawiki-config changes were in before the group0 update [20:40:51] and they are still in after the revert :) [20:41:01] if we learn't anything from stargate, it was to blame ori <3 [20:41:02] "This also catches foreign DBs which might slip through the cracks." [20:41:08] yes [20:41:10] it certainly caught them [20:41:51] ori: ori ori it must be ori ;) [20:42:15] well, aaron has been doing some heroic work with the database code [20:42:19] It's not clear from the code what went wrong though [20:42:27] on the whole it has been pretty smooth sailing [20:42:32] yeah me neither, that's why I discarded it before [20:42:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55048 bytes in 0.219 second response time [20:42:39] i think this is the group0 thing working as it should [20:43:01] heh good point [20:43:15] (03PS1) 10Ottomata: Only use TcpConnStates diamond collector on parsoid and parsoid varnish hosts [puppet] - 10https://gerrit.wikimedia.org/r/246084 [20:43:20] although we should be a bit faster on the revert when it goes wrong [20:43:26] ori ^^ [20:43:39] set setLBInfo overwrite everything or something? :| [20:43:44] does setLBInfo* [20:44:31] doesn't look like it [20:45:29] <_joe_> ori: http://i.imgur.com/HNkdgfZ.jpg [20:45:35] heh [20:45:50] it isn't great that Beta Cluster wasn't updating since Oct 11th due to the dblist change, though :/ [20:45:50] Do you guys have a fix yet? [20:45:52] aaron has been asking for CR of some CentralAuth patches, maybe he wasn't expecting them to be merged out of sequence [20:45:54] !log ori@tin Synchronized php-1.27.0-wmf.3/autoload.php: (no message) (duration: 01m 14s) [20:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:00] greg-g: haha gotta get your fix [20:46:15] * greg-g is sporadically reading scrollback ;) [20:46:19] (03CR) 10Hashar: [C: 031] "Note that if you ever need to bump the composer version in integration/composer.git , that has a huge impact on CI." [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) (owner: 10Merlijn van Deen) [20:46:23] (03CR) 10Ori.livneh: [C: 031] "Nice find." [puppet] - 10https://gerrit.wikimedia.org/r/246084 (owner: 10Ottomata) [20:46:32] * greg-g is still trying to figure out the Beta Cluster issue [20:46:40] 6operations, 10Gitblit: git.wikimedia.org down - https://phabricator.wikimedia.org/T115363#1723237 (10akosiaris) And restarted again. [20:46:46] it seems code is updated, but special:version isn't showing an updated date [20:47:06] hoo: yes [20:47:12] (03CR) 10Ottomata: [C: 032] Only use TcpConnStates diamond collector on parsoid and parsoid varnish hosts [puppet] - 10https://gerrit.wikimedia.org/r/246084 (owner: 10Ottomata) [20:47:16] (03PS1) 10Andrew Bogott: Set service_running => false for mdns and pool manager on spare designate server. [puppet] - 10https://gerrit.wikimedia.org/r/246089 (https://phabricator.wikimedia.org/T115347) [20:47:24] Ok, cool [20:47:33] Will not commit then [20:47:45] twentyafterfour: seems safe to try rolling out wmf3 again [20:47:50] it looks good on testwiki [20:47:59] !log ori@tin Synchronized php-1.27.0-wmf.3/includes/db: I0e5f2d3b2: Revert Enforce lagged-slave read-only mode on the DB layer (duration: 01m 14s) [20:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:08] ori: ok [20:48:33] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246090 [20:48:57] ori: Wait... you jsut reverted it? [20:49:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [20:49:21] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246090 (owner: 1020after4) [20:49:27] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246090 (owner: 1020after4) [20:49:39] hoo: looks like he reverted it indeed [20:49:48] mh... do you have a proper fix? [20:49:49] y..yes? [20:49:53] I'll just upload mine [20:49:54] it isn't great that Beta Cluster wasn't updating since Oct 11th due to the dblist change, though :/ [20:49:58] Would this have been caught in beta? [20:50:07] Do we have a ticket? [20:50:11] Krenair: No [20:50:28] It only uses two DB servers, both have all the databases... right? [20:50:30] Krenair: no [20:50:44] it was also merged on oct 6th [20:51:39] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.3 [20:52:01] me right now: http://i.imgur.com/0DTpPHh.jpg [20:52:06] (03PS1) 10Ottomata: Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 [20:52:25] (03PS2) 10Andrew Bogott: service_unit: Add a service_running arg [puppet] - 10https://gerrit.wikimedia.org/r/246081 (https://phabricator.wikimedia.org/T115347) [20:52:34] (03CR) 10Ori.livneh: [C: 031] Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [20:52:43] (03CR) 10jenkins-bot: [V: 04-1] Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [20:53:13] I'm confused about the state [20:53:21] but here's my patch and why that's aproblem: https://gerrit.wikimedia.org/r/246092 [20:53:36] let aaron review it please [20:53:50] ok [20:54:10] (03PS5) 10Merlijn van Deen: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) [20:54:20] (03PS2) 10Ottomata: Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 [20:54:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [20:54:40] if we just don't change anything nothing will break [20:55:23] (03CR) 10jenkins-bot: [V: 04-1] Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [20:57:02] (03PS3) 10Ottomata: Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 [20:57:49] (03CR) 10jenkins-bot: [V: 04-1] Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [20:58:43] ottomata: maybe simpler to just salt it away? [20:59:34] on all hosts? [20:59:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [20:59:37] ERrrrr [20:59:44] (03PS3) 10EBernhardson: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 [20:59:46] (03PS4) 10Ottomata: Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 [20:59:52] dunno why puppet no like me [21:00:26] ottomata: might as well [21:00:31] it'll get reprovisioned where it is needed [21:00:38] (03CR) 10jenkins-bot: [V: 04-1] Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [21:00:45] yeah but i dont' trust that salt will actually do it! [21:00:49] (03CR) 10Yuvipanda: "Can't this be passed in as part of $service_params?" [puppet] - 10https://gerrit.wikimedia.org/r/246081 (https://phabricator.wikimedia.org/T115347) (owner: 10Andrew Bogott) [21:00:58] heh, might as well try. [21:02:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:15] (03PS6) 10Yuvipanda: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) (owner: 10Merlijn van Deen) [21:03:57] ahhh, ori, chase tells me that /etc/diamond/collectors is a fully puppet amnaged dir, so puppet should do the right thing [21:04:01] and purge theconf file [21:04:10] oh right [21:04:12] which is good enough for me [21:04:14] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review, 10reading-web-sprint-58-6: ml.wikipedia.org not redirecting to mobile site while accessing from a mobile device; many "Error: Module not found" errors - https://phabricator.wikimedia.org/T115191#1723347 (10Krenair) 5Open>3Resolved a:3Krenair [21:04:25] (03Abandoned) 10Ottomata: Need to ensure that TcpConnStates diamond collector is absent on non parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/246091 (owner: 10Ottomata) [21:04:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [21:05:12] !log reenabling puppet on cp1052 [21:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:53] (03CR) 10BryanDavis: [C: 031] Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 (owner: 10EBernhardson) [21:08:31] (03PS1) 10Ori.livneh: provide a public, read-only view of grafana [puppet] - 10https://gerrit.wikimedia.org/r/246096 [21:08:46] paravoid: ^ [21:09:29] having grafana-test is ugly; i think we can decom it in a couple of days and 'promote' the read-only view of grafana2 to be the new grafana.wikimedia.org [21:09:49] once we do that i'll add a comment to the manifest explain who's who [21:18:58] (03CR) 10MaxSem: [C: 031] Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 (owner: 10EBernhardson) [21:29:07] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: puppet fail [21:36:03] (03CR) 10Alex Monk: [C: 031] Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245945 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [21:39:18] Hi, I used to be able to ssh into stat1003.eqiad.wmnet to look at Flow data but now I can't. ( Permission denied (publickey). ). Anyone can help? [21:39:55] stephanebisson: what is your shell username? [21:40:06] JohnFLewis: sbisson [21:42:19] stephanebisson: are you able to ssh to a bastion? (bast1001.wikimedia.org e.g.) [21:43:45] JohnFLewis: I can when I specify my username and key on the command line (my ssh/config is probably wrong here) [21:44:08] but the same options for stat1003 doesn't fix it [21:44:19] stephanebisson: would be a good start :) can you post you config somewhere? [21:44:57] Oh, aha, then maybe your ProxyCommand setup is wrong, or the Host stanzas are [21:46:02] JohnFLewis: https://phabricator.wikimedia.org/P2192 [21:47:54] 6operations, 7Database: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#1723563 (10Neil_P._Quinn_WMF) [21:47:59] Yeah you need to add a user line [21:48:03] to the bast1001 block [21:48:36] stephanebisson, https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-PSTE-tx3kszm5pkhpvis/ [21:49:50] Krenair: doesn't seem to work [21:50:25] Okay. Can you `ssh bast1001.wikimedia.org` without anything extra? [21:50:51] oh, other thing [21:50:56] "IdentityFile ~/.ssh/id_rsa_production" this line also needs to apply to the bastion [21:51:37] Krenair: success for bast1001.wikimedia.org [21:52:03] but not stat1003? [21:52:19] as well, thanks! [21:52:37] the IdentityFile addition to the bastion host fixed it? [21:53:22] (03PS2) 10Andrew Bogott: Ensure mdns and pool-manager services stopped on backup designate server. [puppet] - 10https://gerrit.wikimedia.org/r/246089 (https://phabricator.wikimedia.org/T115347) [21:53:40] (03Abandoned) 10Andrew Bogott: service_unit: Add a service_running arg [puppet] - 10https://gerrit.wikimedia.org/r/246081 (https://phabricator.wikimedia.org/T115347) (owner: 10Andrew Bogott) [21:53:46] I guess, the User and IdentityFile are the only changes [21:54:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:55:19] (03PS3) 10Andrew Bogott: Ensure mdns and pool-manager services stopped on backup designate server. [puppet] - 10https://gerrit.wikimedia.org/r/246089 (https://phabricator.wikimedia.org/T115347) [21:56:23] (03CR) 10Andrew Bogott: [C: 032] Ensure mdns and pool-manager services stopped on backup designate server. [puppet] - 10https://gerrit.wikimedia.org/r/246089 (https://phabricator.wikimedia.org/T115347) (owner: 10Andrew Bogott) [21:57:58] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:58:37] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:01:11] ori AaronSchulz https://i.imgflip.com/shxz3.jpg :) [22:18:01] 6operations, 10Internet-Archive, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1723675 (10Sadads) [22:23:26] Krenair: ori :) yeah, I was just jokingly piling on ori (mostly joking, also thinking "grr, we really shouldn't deploy to prod when Beta hasn't been updating for a few days successfully) [22:23:45] [22:30:09] 6operations: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723723 (10bd808) 3NEW [22:50:13] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723787 (10greg) [22:52:17] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723802 (10greg) p:5Triage>3High Ironically, the messages from this task are being marked as spam as well. See below, my spam folder is filled more with Phab email than rea... [22:55:29] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723815 (10greg) Btw, I just got this when in gmail's web UI to mark them not as spam: {F2717835} Anyone else? [22:58:14] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723824 (10JKrauska) I am not certain the 'lack of identity' is what's causing these emails to be marked as spam. What exactly do you mean by that? Cheers, Joel [22:59:38] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723838 (10greg) >>! In T115416#1723824, @JKrauska wrote: > I am not certain the 'lack of identity' is what's causing these emails to be marked as spam. > > What exactly do yo... [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151013T2300). Please do the needful. [23:00:04] RoanKattouw ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:33] i suppose i can push it all out [23:00:35] RoanKattouw: around? [23:00:47] (03CR) 10EBernhardson: [C: 032] Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 (owner: 10EBernhardson) [23:00:49] I have something to add [23:01:10] (03Merged) 10jenkins-bot: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 (owner: 10EBernhardson) [23:01:57] ebernhardson, Yup I'm here [23:02:16] I want to add one more patch but I'm still waiting for it to merge [23:02:40] Jenkins is producing bogus failures more often than not in the Flow repo and we don't yet know why [23:02:43] going to be a busy swat :) [23:02:51] Oh look it merged [23:02:52] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Turn on logging of the "warning" channel (duration: 01m 13s) [23:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:57] Now I just need to cherry-pick it by hand :S [23:03:05] 7Puppet, 6Labs: dynamicproxy: Move list of blocked user agents to hiera - https://phabricator.wikimedia.org/T90844#1723863 (10Krenair) a:3Krenair [23:03:19] ebernhardson, Is it OK if I start merging my own cherry-picks already? [23:03:37] I have to force-merge them because Jenkins, and I think I need one of them to merge before I create the other cherry-pick [23:03:55] RoanKattouw: hmm, i think it should be fine, even if we end up pulling them to tin before ready for deploy can just not sync the dir. [23:04:02] Right [23:04:57] (03PS1) 10Alex Monk: dynamicproxy: Make blocked user agents configurable [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) [23:05:47] ebernhardson, OK, done, including the extra one [23:08:11] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch/: Update cirrus for common terms AB tes (duration: 01m 15s) [23:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:28] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/: Update WME for common terms AB test (duration: 01m 14s) [23:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:17] RoanKattouw: looks like flow in wmf-3 is ready for sync? [23:11:21] Yes [23:11:22] And wmf-2 too [23:11:28] kk [23:13:05] and hes gone :) [23:13:31] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/Flow: Bump flow submodule in 1.27.0-wmf.3 (duration: 01m 15s) [23:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:24] RoanKattouw: wmf.3 is out, pushing .2 in a moment [23:15:30] Yup, checking wmf3 now [23:16:19] Is it just me or has wikitech started losing session data frequently again? [23:16:47] We had reports yesterday [23:16:53] andrewbogott said it's nothing to worry about [23:17:20] Yeah I get logged out of wikitech every 15 mins or so [23:17:33] I have to log in again every time I want to edit the deployments page [23:17:33] Now I can't save [23:17:37] Logged out, can't log back in again [23:17:39] ebernhardson, wmf.3 looks good [23:19:26] yep, wikitech is broken [23:19:46] RoanKattouw: i think you broke jenkins :( [23:19:52] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/Flow: Make flow board descriptions editable again (duration: 01m 16s) [23:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:43] https://gerrit.wikimedia.org/r/#/c/246076/ is merged (via force-merge) but https://integration.wikimedia.org/zuul/ shows the gate-and-submit pipeline blocked on it [23:20:50] i'll just force merge these others, but someone has to kick something :) [23:20:56] sigh wtf [23:20:57] !log restarted apache on silver [23:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:16] Yeah it's stuck on one of my force-merges [23:21:16] logged in successfully now [23:21:22] But two of the others made it through OK [23:22:06] saved successfully too [23:22:24] argh [23:22:27] it just failed again [23:24:21] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1723937 (10bd808) All replies to this task have ended up in gmail's spam folder for me as well: {F2717875} [23:25:20] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Add information for common terms ab test (duration: 01m 15s) [23:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:12] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Turn on cirrus common terms test (duration: 01m 15s) [23:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:28] Krenair: i'm guessing its intentional (since no $lang exists), but its supposed to be in single quotes? https://gerrit.wikimedia.org/r/#/c/245945/1/wmf-config/InitialiseSettings.php [23:27:48] yes [23:28:09] (03CR) 10EBernhardson: [C: 032] Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245945 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [23:28:31] (03Merged) 10jenkins-bot: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245945 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [23:30:24] (03CR) 10Ori.livneh: [C: 032] Add grafana-admin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/245983 (owner: 10Ori.livneh) [23:30:38] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default (duration: 01m 14s) [23:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:49] Krenair: ^ [23:32:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55003 bytes in 0.500 second response time [23:34:17] (03PS2) 10Ori.livneh: provide a public, read-only view of grafana [puppet] - 10https://gerrit.wikimedia.org/r/246096 [23:34:27] that should conclude swat, just waiting on the OK from krenair [23:37:08] (03PS3) 10Ori.livneh: provide a public, read-only view of grafana [puppet] - 10https://gerrit.wikimedia.org/r/246096 [23:37:27] (03CR) 10Ori.livneh: [C: 032 V: 032] provide a public, read-only view of grafana [puppet] - 10https://gerrit.wikimedia.org/r/246096 (owner: 10Ori.livneh) [23:37:45] ebernhardson, looks good [23:37:57] thanks [23:38:14] sweet [23:47:58] (03CR) 10Tim Landscheidt: "For non-confusion, I'd prefer if we stick to the "detailed" error message for requests with no User-Agent. I have some gruesome memories " [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) (owner: 10Alex Monk) [23:50:56] 6operations, 10Gitblit: git.wikimedia.org down - https://phabricator.wikimedia.org/T115363#1724005 (10greg) [23:50:59] 6operations, 10Gitblit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1724006 (10greg) [23:58:47] (03PS2) 10Alex Monk: dynamicproxy: Make blocked user agents configurable [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844)