[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170224T0000). Please do the needful. [00:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:12] Present [00:02:12] I can SWAT [00:03:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [00:04:59] (03Merged) 10jenkins-bot: Store goodfaith scores in the ORES tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [00:07:11] matt_flaschen: patch is on mwdebug1002, check please [00:07:13] (03CR) 10jenkins-bot: Store goodfaith scores in the ORES tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [00:09:58] (03PS4) 10Volans: Cumin: authorize also cumin masters IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) [00:17:21] !log restbase deploying b477ab46 [00:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:15] thcipriani, I don't think I can properly test it on mwdebug1002. I made an edit (which goes into the classification for the *old* model), but it spawns a job which is not run on mwdebug1002. [00:22:38] Sorry, it took me a bit of time to figure out how to try to test it and look at the code more. [00:22:46] matt_flaschen: no problem [00:23:53] ok so should I go ahead with the deploy? Any cause for concern? [00:24:58] thcipriani, no. No regressions, RC also looks fine, including ORES features. [00:25:10] matt_flaschen: ok, going live [00:26:46] Thanks [00:26:50] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:339472|Store goodfaith scores in the ORES tables]] T137966 (duration: 00m 40s) [00:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:55] T137966: Include goodfaith model information in ORES review tool - https://phabricator.wikimedia.org/T137966 [00:26:58] ^ matt_flaschen should be live everywhere now [00:27:04] Great, let me test now. [00:30:23] Looks good [00:30:36] cool :) [00:55:36] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3051895 (10dpatrick) [00:58:34] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:00:34] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 268 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:05:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 14 probes of 268 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:17:44] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.001 second response time [01:18:04] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.001 second response time [01:26:34] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:26:59] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 07m 02s) [02:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Feb 24 02:32:21 UTC 2017 (duration 5m 22s) [02:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:44] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 705.12 seconds [03:29:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.25 seconds [06:51:56] (03CR) 10Giuseppe Lavagetto: [C: 031] Cumin: authorize also cumin masters IPv6 addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [06:58:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like this (I already argued we needed this as a fact in the past), a couple implementation questions." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis) [06:59:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Add support for 'not' in simple hosts selection [software/cumin] - 10https://gerrit.wikimedia.org/r/339213 (https://phabricator.wikimedia.org/T158748) (owner: 10Volans) [07:05:50] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3052321 (10Marostegui) I saw Chris email on IRC that the idrac firmware updated we performed earlier in the evening didn't go thru and he did another one. I can confirm that it is still not working after that last upgra... [07:11:24] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:15] (03PS1) 10Marostegui: db-codfw.php: Repool db2069 depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339602 (https://phabricator.wikimedia.org/T132416) [07:24:01] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2069 depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339602 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:25:30] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2069 depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339602 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:26:46] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2069 and depool db2070 - T132416 (duration: 00m 45s) [07:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:51] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:27:00] (03CR) 10jenkins-bot: db-codfw.php: Repool db2069 depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339602 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:27:44] (03PS1) 10Giuseppe Lavagetto: Add passwords::etcd [labs/private] - 10https://gerrit.wikimedia.org/r/339603 [07:28:44] (03PS2) 10Giuseppe Lavagetto: Add passwords::etcd [labs/private] - 10https://gerrit.wikimedia.org/r/339603 [07:32:26] !log Deploy alter table enwiki.revision on db2070 - T132416 [07:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:31] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:33:04] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:34:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add passwords::etcd [labs/private] - 10https://gerrit.wikimedia.org/r/339603 (owner: 10Giuseppe Lavagetto) [07:37:57] (03PS2) 10Giuseppe Lavagetto: conftool: remove base class, useless in the refactor [puppet] - 10https://gerrit.wikimedia.org/r/339459 [07:39:24] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:00:04] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:06:22] (03PS2) 10Jcrespo: mariadb/prometheus: remove workaround for precise [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:07:15] (03CR) 10Volans: [C: 032] Add support for 'not' in simple hosts selection [software/cumin] - 10https://gerrit.wikimedia.org/r/339213 (https://phabricator.wikimedia.org/T158748) (owner: 10Volans) [08:07:51] (03Merged) 10jenkins-bot: Add support for 'not' in simple hosts selection [software/cumin] - 10https://gerrit.wikimedia.org/r/339213 (https://phabricator.wikimedia.org/T158748) (owner: 10Volans) [08:08:17] (03PS2) 10Volans: Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) [08:08:26] (03PS2) 10Volans: Match the whole string for hosts regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/339177 (https://phabricator.wikimedia.org/T158746) [08:08:34] (03PS4) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [08:09:59] (03CR) 10Jcrespo: [C: 031] "This can go now, although the new (reimaged) tools servers will not work until we fix the socket configuration. I am trying to do that at:" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:10:51] (03PS1) 10Muehlenhoff: Remove absense check from data_test.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339606 [08:11:45] (03PS5) 10Volans: Cumin: authorize also cumin masters IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) [08:11:50] (03CR) 10jerkins-bot: [V: 04-1] Remove absense check from data_test.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339606 (owner: 10Muehlenhoff) [08:16:34] (03PS2) 10Muehlenhoff: Remove absense check from data_test.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339606 [08:16:57] !log temporary disabled puppet on neodymium and sarin to deploy Gerrit 339183 - T158753 [08:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:04] T158753: Cumin: authorize also IPv6 on the targets - https://phabricator.wikimedia.org/T158753 [08:17:34] (03CR) 10Volans: [C: 032] Cumin: authorize also cumin masters IPv6 addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [08:18:33] madhuvishy: you still around? [08:19:40] there is an unmerged patch of yours [08:20:39] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3052389 (10elukey) After a long investigation (and lot of changes!) we are almost ready to switch the first shard (mc1001 -> mc1019) following this procedure:... [08:20:46] volans: --^ :) [08:20:54] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:22:44] elukey: looking in a second, need to figure out the puppet unmerged patch [08:23:14] volans: sure sure nothing to check, it is what we discussed yesterday about cumin and mc1001->mc1019 swap [08:23:24] great [08:23:48] madhuvishy: FYI I'm puppet-merging https://gerrit.wikimedia.org/r/#/c/334218 given it's already merged [08:26:24] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:24] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:34] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:27:12] this is me ^^^, or better, puppet race condition :( [08:28:24] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:41:25] (03CR) 10Filippo Giunchedi: [C: 031] "No harm in merging I suppose" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:44:48] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1001.eqiad.wmnet [08:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:03] wdqs1001 is in trouble, I'm having a look... [08:48:54] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:51:58] volans: oops yeah thanks [08:52:05] (03CR) 10Elukey: [C: 031] Match the whole string for hosts regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/339177 (https://phabricator.wikimedia.org/T158746) (owner: 10Volans) [08:52:55] madhuvishy: no problem :) [08:53:55] wdqs1001 was put in maintenance by Stas, I'll wait for him before re-enabling it [08:54:24] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:55:34] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:02:35] (03CR) 10Volans: [C: 032] Match the whole string for hosts regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/339177 (https://phabricator.wikimedia.org/T158746) (owner: 10Volans) [09:03:13] (03Merged) 10jenkins-bot: Match the whole string for hosts regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/339177 (https://phabricator.wikimedia.org/T158746) (owner: 10Volans) [09:05:37] (03CR) 10Elukey: [C: 031] Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) (owner: 10Volans) [09:06:52] (03PS3) 10Volans: Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) [09:08:07] (03CR) 10Volans: [C: 032] Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) (owner: 10Volans) [09:15:27] (03PS2) 10Volans: Cumin: disable urllib3 SubjectAltNameWarning [puppet] - 10https://gerrit.wikimedia.org/r/339180 (https://phabricator.wikimedia.org/T158758) [09:18:06] (03CR) 10Volans: [C: 032] Cumin: disable urllib3 SubjectAltNameWarning [puppet] - 10https://gerrit.wikimedia.org/r/339180 (https://phabricator.wikimedia.org/T158758) (owner: 10Volans) [09:20:29] (03Merged) 10jenkins-bot: Allow to ignore selected urllib3 warnings [software/cumin] - 10https://gerrit.wikimedia.org/r/339179 (https://phabricator.wikimedia.org/T158758) (owner: 10Volans) [09:27:40] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2365691 (10fgiunchedi) Thanks @elukey ! While we're at it with (de)commissioning hosts from nutcracker there's also a config change to listen for stats on loca... [09:32:24] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 79 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [09:32:25] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3052523 (10elukey) [09:33:54] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:12] (03PS1) 10Marostegui: db-eqiad.php: Repool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339609 (https://phabricator.wikimedia.org/T154485) [09:37:24] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [09:39:50] !log stop Redis and Memcached on mc2001->mc2016 as extra precautionary step before decom - T157675 [09:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:55] T157675: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675 [09:44:56] (03PS3) 10Urbanecm: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) [09:45:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339609 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [09:46:19] (03CR) 10Urbanecm: "Thank you for your notification! Didn't notice that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [09:47:12] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339609 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [09:47:22] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339609 (https://phabricator.wikimedia.org/T154485) (owner: 10Marostegui) [09:47:50] (03PS4) 10Urbanecm: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) [09:48:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1036 - T154485 (duration: 00m 40s) [09:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:19] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [09:48:43] (03PS1) 10Jcrespo: prometheus: Make server-board a featured dashboard [puppet] - 10https://gerrit.wikimedia.org/r/339610 [09:49:17] (03CR) 10Marostegui: [C: 031] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [09:57:25] (03PS1) 10Elukey: Remove last settings for mc2001->mc2016 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/339611 (https://phabricator.wikimedia.org/T157675) [09:58:08] 06Operations: Graphite-web version in our repo cannot be installed due to missing dependencies - https://phabricator.wikimedia.org/T158802#3052568 (10fgiunchedi) 05Open>03Resolved Indeed @Andrew I did backport graphite 0.9.15 but forgot python-django-tagging, now done! graphite-web should be installable/upgr... [10:00:38] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: Make server-board a featured dashboard [puppet] - 10https://gerrit.wikimedia.org/r/339610 (owner: 10Jcrespo) [10:00:54] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:02:31] (03PS1) 10Jcrespo: prometheus: Rename the server boad file to a .json extension [puppet] - 10https://gerrit.wikimedia.org/r/339613 [10:02:43] (03CR) 10Jcrespo: [C: 032] prometheus: Make server-board a featured dashboard [puppet] - 10https://gerrit.wikimedia.org/r/339610 (owner: 10Jcrespo) [10:02:45] (03PS1) 10Jcrespo: mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) [10:03:31] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3052577 (10elukey) Hello @Papaul, @RobH and @Cmjohnson! While reviewing https://wikitech.wikimedia.org/wiki/Server_Lifecycle... [10:04:39] (03CR) 10Jcrespo: "I can do the others too, if you prefer it." [puppet] - 10https://gerrit.wikimedia.org/r/339613 (owner: 10Jcrespo) [10:23:35] !log installing spice updates on trusty [10:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:18] (03PS2) 10Giuseppe Lavagetto: role::conftool::master: remove as it is unused [puppet] - 10https://gerrit.wikimedia.org/r/339458 [10:28:02] (03CR) 10Giuseppe Lavagetto: [C: 032] role::conftool::master: remove as it is unused [puppet] - 10https://gerrit.wikimedia.org/r/339458 (owner: 10Giuseppe Lavagetto) [10:30:29] !log installing imagemagick regression update for security update on trusty (the Debian update seems unaffected) [10:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] (03CR) 10Jcrespo: [C: 04-1] "Wait for the slave to catch up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [10:31:54] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:37] (03PS5) 10Ema: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) [10:34:46] (03CR) 10Ema: [V: 032 C: 032] cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema) [10:35:23] (03CR) 10Filippo Giunchedi: [C: 04-1] "grafana::dashboard already adds .json as needed" [puppet] - 10https://gerrit.wikimedia.org/r/339613 (owner: 10Jcrespo) [10:35:57] jynus: ^ if not having .json is not intuitive we can change the module to not add the extension and rename [10:36:30] I do not really have a thought on that [10:37:18] as in [10:37:46] I thought maybe having the .json on puppet would be better for linting purposes [10:38:17] but I do not care too much, I am ok with you deciding on it in the future [10:40:00] (03Abandoned) 10Jcrespo: prometheus: Rename the server boad file to a .json extension [puppet] - 10https://gerrit.wikimedia.org/r/339613 (owner: 10Jcrespo) [10:40:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC says only expected changes will happen" [puppet] - 10https://gerrit.wikimedia.org/r/339459 (owner: 10Giuseppe Lavagetto) [10:41:06] (03PS3) 10Giuseppe Lavagetto: conftool: remove base class, useless in the refactor [puppet] - 10https://gerrit.wikimedia.org/r/339459 [10:41:49] !log cache_misc: upgrading to varnish 4.1.5 [10:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [10:49:34] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:59:54] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:03:14] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:53] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3052693 (10elukey) >>! In T154558#3043971, @JoeWalsh wrote: > @Milimetric this UA is from the iOS app. In testing locally, I didn't see... [11:26:12] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3052716 (10Gehel) The issue seems to be with: ``` RewriteRule ^/(upload|wiki|stats|w)/(.*)$ %{ENV:RW_PROTO}://en.... [11:30:14] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:31:08] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3052728 (10elukey) More numbers about number of requests landing to piwik/apache/bohrium and failed ones (503s). The following numbers... [11:32:58] !seen MZMcBride [11:43:03] (03CR) 10Jcrespo: "We can add all already, and add them manually if we have stopped replication and restarted a few servers." [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [11:51:42] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#2962020 (10Nemo_bis) > a range of issues It would be useful to document what aspects were considered beyond network, legal and cost. For instance, was environmental impact considered (cf. http://www.greenpea... [12:01:36] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052793 (10tomasz) Given that most commercial data centers in Singapore seem to based on 0% renewable energy, I would really like to know whether environmental concerns were taken into consideration when this... [12:11:54] (03CR) 10Faidon Liambotis: Add an interface_primary fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis) [12:13:11] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052807 (10Gnom1) Hi, where can we have a good discussion about the need to choose a datacenter that runs on renewable energy? I suppose that this bug is not the ideal location. Thanks for any pointers! [12:22:52] (03CR) 10Giuseppe Lavagetto: [C: 031] Add an interface_primary fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis) [12:26:07] (03CR) 10Giuseppe Lavagetto: [C: 031] Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659) (owner: 10Volans) [12:28:54] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:44] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:30:18] (03CR) 10Giuseppe Lavagetto: [C: 031] Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [12:35:44] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:51] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: use passwords::etcd [puppet] - 10https://gerrit.wikimedia.org/r/339625 [13:04:44] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:06:05] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052899 (10BBlack) This probably isn't the ideal location, but I can speak to the issue here since it's obvious that some will come looking here for that answer. The TL;DR is that environmental consideration... [13:10:45] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:17] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052915 (10Gnom1) Thank you for this information, Brandon. While your points are understandable, this does not mean that we should not try to find a vendor that uses renewable energy for their servers. So aga... [13:19:44] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:44] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:28:38] 06Operations, 10ops-codfw, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3052924 (10elukey) [13:33:07] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052927 (10BBlack) I think you can discuss that anywhere you like (within reason!). To clarify re: your language above: we deploy our own server hardware as opposed to using virtual hosting, so the environme... [13:33:34] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 62.71% of data above the critical threshold [3686.4] [13:35:34] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3276.8] [13:39:44] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:40:34] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 62.71% of data above the critical threshold [3686.4] [13:42:04] PROBLEM - puppet last run on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:50] (03CR) 10Marostegui: "Hi! All = all the shards you mean or all the servers in s2?" [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [13:44:04] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [13:44:05] (03PS2) 10Ema: WIP: prometheus: add node tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/339465 [13:46:12] (03CR) 10Jcrespo: "Every where unconditionally- note I refer here to puppet and the config on file, we can do it as slow as you want it for the masters/other" [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [13:48:02] brand new alarms for HDFS namenode, going to check [13:57:45] (03PS3) 10Volans: Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) [13:58:54] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:24] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3052998 (10Gnom1) Thank you for your reply, Brandon. Maybe I should clarify my question: Where can //Wikipedians// have a discussion //with you and your team// about //running Wikipedia's servers on renewable... [14:04:34] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3276.8] [14:07:55] (03CR) 10Volans: [C: 032] Keyholder: fix filter of passwordless keys [puppet] - 10https://gerrit.wikimedia.org/r/338984 (https://phabricator.wikimedia.org/T158660) (owner: 10Volans) [14:08:17] (03PS2) 10Volans: Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659) [14:09:54] (03CR) 10Volans: [C: 032] Keyholder: add support for ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/339002 (https://phabricator.wikimedia.org/T158659) (owner: 10Volans) [14:12:10] _joe_: there is an unmerged patch of yours (https://gerrit.wikimedia.org/r/#/c/339459 ) can I proceed merging? [14:12:40] <_joe_> volans: yes, go on [14:12:44] <_joe_> volans: meh [14:13:07] lol [14:13:34] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:13:51] _joe_: merged [14:14:04] <_joe_> volans: ack thx [14:14:11] yw [14:14:41] eventual keyholder alerts are mine, re-arming them [14:16:04] PROBLEM - puppet last run on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:04] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [14:17:54] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [14:19:54] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [14:21:38] 06Operations, 10MediaWiki-Configuration: Request to lift registration limit on IP addresses for event [URGENT] - https://phabricator.wikimedia.org/T158963#3053040 (10Zppix) [14:24:48] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3053045 (10BBlack) I really don't mean to be overly facile here, but if you're interested in having a discussion, we can have that at any usual public discussion venue. The wikitech mailing list might be a g... [14:27:36] !log re-started and re-armed keyholder after upgrade on: mira.codfw.wmnet,neodymium.eqiad.wmnet,sarin.codfw.wmnet,tin.eqiad.wmnet T158660 T158659 [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] T158660: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660 [14:27:43] T158659: Keyholder: add support for ED25519 keys - https://phabricator.wikimedia.org/T158659 [14:27:54] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:28:18] 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Keyholder accept passwordless keys - https://phabricator.wikimedia.org/T158660#3053053 (10Volans) 05Open>03Resolved [14:29:36] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3053061 (10Milimetric) It seems to me you can close this task and open up a new one to investigate Varnish / Apache problems (as those a... [14:39:25] (03CR) 10Hoo man: [C: 031] Disallow geo-shape data type on wikidata for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339446 (https://phabricator.wikimedia.org/T158849) (owner: 10Aude) [14:41:34] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:44:04] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:40] (03PS1) 10Volans: NRPE: fix description for check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/339636 [14:46:55] apergos: Have you seen this? https://phab.wmfusercontent.org/file/data/dqidq3xm2vys7ofc3hsz/PHID-FILE-hbdmho7epcvfz6osajfb/pasted_file [14:48:22] Amir1: what's the issue? [14:48:23] (03CR) 10Volans: [C: 032] NRPE: fix description for check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/339636 (owner: 10Volans) [14:48:44] apergos: This is my proposal of the new design for progress.html [14:48:44] I've been watching the dumps (including wikidatawiki) every day and they seem to be doing fine [14:48:48] ahh [14:48:50] sorry [14:48:59] I thought you were asking something about the run status :-D [14:49:09] yes, I have seen it, though I have not yet tested it [14:49:25] Awesome. No rush. [14:49:49] glad to see that change in, thta's the last of the css changes isn't it? [14:50:38] yes, that will be the last [14:55:22] (03PS1) 10Volans: Add codecov and codacy config and badges [software/cumin] - 10https://gerrit.wikimedia.org/r/339637 (https://phabricator.wikimedia.org/T154588) [14:55:54] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:25] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3053131 (10Gnom1) Oh, I've already tried [[ https://lists.wikimedia.org/pipermail/wikitech-l/2016-March/085128.html | writing to wikitech-l ]], which did not lead anywhere. I also asked to be added to ops-l,... [15:04:44] RECOVERY - DPKG on labmon1001 is OK: All packages OK [15:05:14] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:06:37] (03CR) 10Volans: [C: 032] Add codecov and codacy config and badges [software/cumin] - 10https://gerrit.wikimedia.org/r/339637 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:07:20] (03Merged) 10jenkins-bot: Add codecov and codacy config and badges [software/cumin] - 10https://gerrit.wikimedia.org/r/339637 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:10:14] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:04] RECOVERY - puppet last run on db1088 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:15:04] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 180 seconds ago with 0 failures [15:16:03] (03PS2) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: use passwords::etcd [puppet] - 10https://gerrit.wikimedia.org/r/339625 [15:16:13] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::tlsproxy: use passwords::etcd [puppet] - 10https://gerrit.wikimedia.org/r/339625 (owner: 10Giuseppe Lavagetto) [15:18:42] (03PS1) 10Volans: Add codacy badge [software/cumin] - 10https://gerrit.wikimedia.org/r/339639 [15:19:47] (03CR) 10Volans: [C: 032] Add codacy badge [software/cumin] - 10https://gerrit.wikimedia.org/r/339639 (owner: 10Volans) [15:20:36] (03Merged) 10jenkins-bot: Add codacy badge [software/cumin] - 10https://gerrit.wikimedia.org/r/339639 (owner: 10Volans) [15:21:32] (03PS1) 10Elukey: Tune JVM Heap size alarms for Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/339640 [15:21:54] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:34] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:59] (03CR) 10Elukey: [C: 032] Tune JVM Heap size alarms for Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/339640 (owner: 10Elukey) [15:23:44] (03PS2) 10Giuseppe Lavagetto: role::pybaltest: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339417 [15:23:54] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:26:09] (03CR) 10Giuseppe Lavagetto: [C: 032] role::pybaltest: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339417 (owner: 10Giuseppe Lavagetto) [15:27:10] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3053150 (10BBlack) I think the wikitech discussion seems like it was, in fact, a good discussion of the issue. So if your goal is discussion, I don't see the issue here. The metawiki page does contain a lot... [15:28:54] (03PS1) 10Giuseppe Lavagetto: role::pybaltest: allow using experimental packages [puppet] - 10https://gerrit.wikimedia.org/r/339641 [15:30:51] (03CR) 10Giuseppe Lavagetto: [C: 032] role::pybaltest: allow using experimental packages [puppet] - 10https://gerrit.wikimedia.org/r/339641 (owner: 10Giuseppe Lavagetto) [15:32:58] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:03] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: use passwords::etcd [puppet] - 10https://gerrit.wikimedia.org/r/339643 [15:33:31] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::replication: use passwords::etcd [puppet] - 10https://gerrit.wikimedia.org/r/339643 (owner: 10Giuseppe Lavagetto) [15:35:01] !log temporarily bumping timeout_idle to 60s on cache_misc T154558 [15:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:10] T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558 [15:37:25] (03PS2) 10BBlack: normalize host header a little better [puppet] - 10https://gerrit.wikimedia.org/r/325855 [15:37:38] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:17] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: also remove the hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/339644 [15:39:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::replication: also remove the hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/339644 (owner: 10Giuseppe Lavagetto) [15:39:36] (03Abandoned) 10BBlack: normalize host header a little better [puppet] - 10https://gerrit.wikimedia.org/r/325855 (owner: 10BBlack) [15:41:58] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:27] <_joe_> I hate our puppetmaster puppet code [15:42:37] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3053179 (10Gnom1) The goal is to //have Wikipedia's servers run on renewable energy//. It's as simple as that. In Europe, this is a no-brainer, while I understand that it is not so much in the U.S. But Google... [15:43:46] (03PS1) 10Volans: Fixing minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339647 (https://phabricator.wikimedia.org/T158967) [15:44:01] (03PS1) 10BBlack: VCL: do not allow empty url when un-proxying [puppet] - 10https://gerrit.wikimedia.org/r/339648 [15:44:34] 06Operations: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798#3053185 (10MoritzMuehlenhoff) The dbproxy hosts don't use base::firewall yet, we talked about that in the past, but haven't scheduled a date yet. But that's hardly limited to the ferm classes,... [15:47:58] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:49:42] (03CR) 10Faidon Liambotis: [C: 032] Fixing minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339647 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [15:49:58] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:50:35] (03Merged) 10jenkins-bot: Fixing minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339647 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [15:51:20] (03CR) 10Faidon Liambotis: [C: 032] Reorder check for timesyncd or ntpd [puppet] - 10https://gerrit.wikimedia.org/r/338364 (owner: 10Muehlenhoff) [15:51:38] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:52:57] (03CR) 10Faidon Liambotis: [C: 04-1] "Need to add the reverse DNS too :)" [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [15:54:08] (03CR) 10Faidon Liambotis: [C: 031] Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:54:49] (03PS2) 10Faidon Liambotis: Add an interface_primary fact [puppet] - 10https://gerrit.wikimedia.org/r/339231 [15:55:03] (03CR) 10Faidon Liambotis: [C: 032] Add an interface_primary fact [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis) [15:56:28] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:55] 06Operations, 10MediaWiki-extensions-Scribunto, 07HHVM: luasandbox profiler doesn't sort results due to an HHVM bug - https://phabricator.wikimedia.org/T158029#3053190 (10Anomie) p:05Triage>03Low The bug should be fixed now in the master branch of the mediawiki/php/luasandbox repo. As far as I know, to g... [15:57:48] (03PS1) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [16:01:30] (03PS1) 10Faidon Liambotis: Kill ubuntu.wikimedia.org legacy hostname [puppet] - 10https://gerrit.wikimedia.org/r/339652 [16:02:38] (03PS1) 10Faidon Liambotis: Kill ubuntu.wikimedia.org legacy hostname [dns] - 10https://gerrit.wikimedia.org/r/339653 [16:04:20] (03PS2) 10Volans: Add reverse mapped IPv6 for neodymium and sarin [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) [16:06:20] (03PS3) 10Volans: Add reverse mapped IPv6 for neodymium and sarin [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) [16:13:08] 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#3053233 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:13:15] (03CR) 10Faidon Liambotis: [C: 032] Add reverse mapped IPv6 for neodymium and sarin [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [16:13:59] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3053235 (10BBlack) >>! In T156029#3053179, @Gnom1 wrote: > The goal is to //have Wikipedia's servers run on renewable energy//. It's as simple as that. I don't think that's a realistic goal anytime soon. >... [16:14:27] PROBLEM - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:14:28] ACKNOWLEDGEMENT - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T158969 [16:14:32] 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3053239 (10ops-monitoring-bot) [16:14:37] (03PS1) 10Muehlenhoff: Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) [16:17:08] marostegui: did you broke another one? [16:17:12] :-P [16:17:31] (03CR) 10jerkins-bot: [V: 04-1] Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [16:20:08] (03PS1) 10Gehel: portals: do not rewrite 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) [16:20:53] it could be the temperature [16:21:08] do we have trends on that? [16:21:37] I didn't!! [16:21:37] :) [16:22:03] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3053253 (10Marostegui) [16:22:33] (03PS2) 10Muehlenhoff: Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) [16:23:07] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3053239 (10Marostegui) Hey @Cmjohnson it should be safe to change this disk when you have time Thanks! [16:23:51] (03CR) 10Jcrespo: [C: 031] mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [16:24:31] (03CR) 10Rush: [C: 031] "sounds good, this is more of a lint check than any user state validation fyi. Preventing the breakage where a user is absented but still " [puppet] - 10https://gerrit.wikimedia.org/r/339606 (owner: 10Muehlenhoff) [16:24:56] (03PS2) 10Jcrespo: mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) [16:25:02] (03CR) 10jerkins-bot: [V: 04-1] Add consistency check for nda and wmf LDAP groups based on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/339656 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [16:25:08] I am going to pool db1026 [16:25:11] it is late for me [16:25:23] but I think it is better than going to the weekend with poor redundancy [16:26:40] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3053279 (10Papaul) @elukey I get take over from here. Thanks [16:27:10] (03CR) 10Marostegui: [C: 031] mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [16:27:58] (03PS1) 10Giuseppe Lavagetto: role::pybaltest: fix compilation [puppet] - 10https://gerrit.wikimedia.org/r/339660 [16:29:48] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3053285 (10chasemp) Note: to do the DRBD direct link they may need to be close, even adjoining racks. I'm not sure. Kind of a cal... [16:33:58] :q [16:44:52] 06Operations: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798#3053303 (10Volans) @jcrespo @Marostegui are you ok with the manual removal of any ferm rule and manual restore of a clean iptables table on `dbproxy1011`? [16:47:37] 06Operations: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798#3053308 (10jcrespo) Sure. Even if it breaks something, it is considered beta. You can give it a check and compare it to dbproxy1010, it should have the same rules. [16:47:57] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:49] (03CR) 10Giuseppe Lavagetto: [C: 032] role::pybaltest: fix compilation [puppet] - 10https://gerrit.wikimedia.org/r/339660 (owner: 10Giuseppe Lavagetto) [16:52:58] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:54:27] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:54:38] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3053327 (10RobH) a:05elukey>03Papaul [16:55:15] !log manually cleaning ferm leftovers on dbproxy1011 - T158798 [16:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:20] T158798: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798 [16:59:43] (03PS3) 10Ema: prometheus: add node tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/339465 [17:05:37] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:13:50] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [17:14:24] (03CR) 10jenkins-bot: mariadb: Repool db1026 after maintenance with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339614 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [17:15:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 after maintenance with low load (duration: 00m 40s) [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:57] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:20:11] 06Operations: Ferm: leftovers on hosts were it was enabled and then removed - https://phabricator.wikimedia.org/T158798#3053396 (10Volans) 05Open>03Resolved a:03Volans Cleanup completed and all looks good so far. Resolving [17:21:10] (03PS1) 10BBlack: varnish: move "apps" data back into manifests [WIP, 1/4] [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [17:21:12] (03PS1) 10BBlack: varnish: switch all clusters to req_handling [WIP, 2/4] [puppet] - 10https://gerrit.wikimedia.org/r/339668 (https://phabricator.wikimedia.org/T134404) [17:21:14] (03PS1) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 [17:22:56] (03CR) 10jerkins-bot: [V: 04-1] varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (owner: 10BBlack) [17:27:10] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3053411 (10Joe) [17:27:12] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, and 2 others: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#3053410 (10Joe) 05Open>03Resolved [17:28:07] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, and 6 others: Expand conftool to support multiple objects via a schema definition. - https://phabricator.wikimedia.org/T155823#3053413 (10Joe) The code is done and a package has been created, although still only i... [17:28:26] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, and 6 others: Expand conftool to support multiple objects via a schema definition. - https://phabricator.wikimedia.org/T155823#3053414 (10Joe) 05Open>03Resolved [17:28:30] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3053415 (10Joe) [17:30:18] (03PS1) 10Gehel: maps: make version of nodejs configureable [puppet] - 10https://gerrit.wikimedia.org/r/339670 [17:31:43] (03PS2) 10BBlack: varnish: switch all clusters to req_handling [WIP, 2/4] [puppet] - 10https://gerrit.wikimedia.org/r/339668 (https://phabricator.wikimedia.org/T134404) [17:31:45] (03PS2) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 [17:32:40] (03PS2) 10Gehel: maps: make version of nodejs configurable [puppet] - 10https://gerrit.wikimedia.org/r/339670 [17:32:51] (03CR) 10jerkins-bot: [V: 04-1] varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (owner: 10BBlack) [17:35:22] (03CR) 10Volans: [C: 04-1] "Missing piece? see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339670 (owner: 10Gehel) [17:36:42] (03PS3) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (https://phabricator.wikimedia.org/T134404) [17:36:45] (03PS1) 10BBlack: varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) [17:37:13] (03PS3) 10Gehel: maps: make version of nodejs configurable [puppet] - 10https://gerrit.wikimedia.org/r/339670 [17:37:50] (03CR) 10jerkins-bot: [V: 04-1] varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [17:37:57] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:37:58] (03CR) 10jerkins-bot: [V: 04-1] varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [17:38:07] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3053422 (10faidon) a:05BBlack>03Cmjohnson [17:38:16] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3053423 (10faidon) a:03Cmjohnson [17:41:08] (03PS1) 10Giuseppe Lavagetto: conftool-data: convert to new format [puppet] - 10https://gerrit.wikimedia.org/r/339672 [17:41:10] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: add default schema [puppet] - 10https://gerrit.wikimedia.org/r/339673 (https://phabricator.wikimedia.org/T149617) [17:41:12] (03PS1) 10Giuseppe Lavagetto: conftool-data: add first discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/339674 (https://phabricator.wikimedia.org/T149617) [17:41:48] <_joe_> bblack: ^^ I'm not going to work on this anymore today, but on monday we might be able to do some testing w gdnsd and this format [17:45:09] _joe_: ok, looks pretty cool [17:50:41] (03PS1) 10Muehlenhoff: Extend access for flemmerich and psinger [puppet] - 10https://gerrit.wikimedia.org/r/339676 [17:54:12] (03CR) 10Muehlenhoff: [C: 032] Extend access for flemmerich and psinger [puppet] - 10https://gerrit.wikimedia.org/r/339676 (owner: 10Muehlenhoff) [17:54:20] (03PS2) 10Muehlenhoff: Extend access for flemmerich and psinger [puppet] - 10https://gerrit.wikimedia.org/r/339676 [17:58:31] (03PS1) 10Muehlenhoff: Add two subsequently assigned CVE IDs [debs/linux44] - 10https://gerrit.wikimedia.org/r/339677 [18:04:40] (03CR) 10Muehlenhoff: [C: 032] Add two subsequently assigned CVE IDs [debs/linux44] - 10https://gerrit.wikimedia.org/r/339677 (owner: 10Muehlenhoff) [18:07:07] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053488 (10shrlak) [18:07:57] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:09:48] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:40] (03PS4) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [18:17:44] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3053517 (10Cmjohnson) @faidon, I will swap out the sfp+ ...that is the most typical culprit. Do we need to schedule downtime? or can I do anytime? [18:19:27] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:27:40] 06Operations, 10hardware-requests: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3053552 (10Dzahn) < mutante> paravoid: i'm back and will try the bast3001 reinstall today. first thing was "should it be install3002 or actually re-use the existing name". i guess keep it 3001 because the inconsist... [18:33:12] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053488 (10RobH) Please note we need to have a few things for this request to be fully vetted: [x] - wiki... [18:33:35] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053556 (10RobH) [18:36:47] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:38:04] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3053568 (10faidon) I believe @ema has depooled it, so any time should be OK. [18:39:00] 06Operations, 06DC-Ops, 06Discovery, 03Interactive-Sprint: Decide what to do with maps-test cluster - https://phabricator.wikimedia.org/T158982#3053572 (10Deskana) [18:41:30] 06Operations, 06DC-Ops, 06Discovery, 03Interactive-Sprint: Decide what to do with maps-test cluster - https://phabricator.wikimedia.org/T158982#3053576 (10Deskana) a:03Gehel @gehel is point person as ops engineer for the team, so assigning to him. We will figure this out together. [18:41:33] 06Operations, 06DC-Ops, 06Discovery, 03Interactive-Sprint: Decide what to do with maps-test cluster - https://phabricator.wikimedia.org/T158982#3053578 (10Deskana) p:05Triage>03Normal [18:42:25] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053581 (10RobH) [18:42:37] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:27] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:50:38] (03CR) 10MarcoAurelio: "We should probably also check if any of this short aliases do conflict with any interwiki links. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [18:51:24] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053618 (10RobH) a:05Jgreen>03None [18:57:17] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:58:15] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053648 (10RStallman-legalteam) @RobH: this is to confirm that @skrlak is under contract with us through C... [18:58:19] 06Operations: content removed from etherpad - https://phabricator.wikimedia.org/T158987#3053649 (10ggellerman) [18:59:11] 06Operations: content removed from etherpad - https://phabricator.wikimedia.org/T158987#3053661 (10chasemp) p:05Triage>03Normal [19:03:12] 06Operations: content removed from etherpad - https://phabricator.wikimedia.org/T158987#3053649 (10greg) Simple vandalism. I found the last revision that looks like you all at https://etherpad.wikimedia.org/p/Ronaldo/timeslider#4531 [19:04:09] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053675 (10RobH) [19:04:47] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:05] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3053683 (10ema) @Cmjohnson yes the system is indeed depooled. Please go ahead whenever it is convenient for you. Thanks! [19:11:37] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:22:53] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053728 (10RobH) [19:23:34] Pchelolo: any particular servers i should expect restbase logging to come from? And is there any way i can force a server to send any logging? I've started tcpdump'ing incoming packets to the logstash machines from restbase1015 but i only see cassandra logs [19:25:18] ebernhardson: no particular server, no.. [19:26:17] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:26:51] Pchelolo: looking at history individual servers sometimes go 20+ minutes without logging anything, so it makes it a bit hard to debug. Could we turn up the logging levels or something while debugging? [19:27:26] ebernhardson: we have a staging cluster for restbase, let me see if there's the same problem there [19:27:54] (03PS1) 10Dzahn: (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) [19:28:14] (03CR) 10jerkins-bot: [V: 04-1] (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) (owner: 10Dzahn) [19:28:30] yup, it seems there's is. Here's the dashboad: https://logstash.wikimedia.org/goto/e21fea6138c5edba7af86b6488f1f8a8 [19:28:46] the logs should be coming from xenon.eqiad.wmnet [19:29:35] (03PS2) 10Dzahn: (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) [19:29:44] Tell me when you're ready, I'll restart RESTBase on that node which would force some logs [19:29:50] (03CR) 10jerkins-bot: [V: 04-1] (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) (owner: 10Dzahn) [19:29:58] Pchelolo: i've got adump going on the three machines now [19:30:38] !log restarting RESTBase on xenon.eqiad.wmnet in staging [19:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] (03PS3) 10Dzahn: (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) [19:31:49] ebernhardson: the logs has to go to logstash1002.eqiad.wmnet [19:32:15] Pchelolo: hmm, a couple packets are coming into 1002, but they arn't plain text (which is what i saw dumping other random sources) [19:32:57] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:33:12] and we're logging via gelf over udp btw [19:33:40] hmm, the latest gelf uses compression i wonder if logstash we are using supports that [19:35:12] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053737 (10RobH) [19:36:30] ebernhardson: we're using exactly the same logging libs and logic for all services. Other services are ok, restbase is not.. [19:37:03] and it started exactly at 12utc on 02-18. Nothing was deployed at the time [19:37:49] (03CR) 10RobH: [C: 031] (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) (owner: 10Dzahn) [19:38:52] (03CR) 10Florianschmidtwelzow: "dependeing change has been merged, so this could be merged after the next train deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) (owner: 10Gergő Tisza) [19:39:13] Pchelolo: but nothing has been deployed to logstash for months either, so something changed just have to find what ... [19:39:38] (03PS4) 10Dzahn: (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) [19:39:41] (03PS1) 10Dzahn: add bast3002 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/339684 (https://phabricator.wikimedia.org/T131560) [19:39:59] is logstash recording anything on that node ebernhardson? Sometimes it just gives up [19:40:29] checking the local elasticsearch logs is usually a good way to tell if logstash is totally busted [19:40:51] (03CR) 10Dzahn: [C: 032] (re-)add hooft as bast3002 [dns] - 10https://gerrit.wikimedia.org/r/339681 (https://phabricator.wikimedia.org/T131560) (owner: 10Dzahn) [19:41:35] bd808: nothing for logstash or elasticsearch logs that suggest problems. afaict [19:43:47] 06Operations, 10hardware-requests: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3053743 (10Dzahn) Change 339681 merged by Dzahn: (re-)add hooft as bast3002 https://gerrit.wikimedia.org/r/339681 [19:44:37] (03PS2) 10Dzahn: add bast3002 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/339684 (https://phabricator.wikimedia.org/T156506) [19:47:37] (03CR) 10Dzahn: "@Aklapper Alrighty, so in this (special) case you or me will have to amend to this change (or abandon it). Feel like doing that?" [puppet] - 10https://gerrit.wikimedia.org/r/317990 (owner: 10Alex Monk) [19:48:35] (03Abandoned) 10Paladox: Phabricator: Up the size for storage.mysql-engine.max-size to 20mb in bytes [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [19:49:40] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053750 (10RStallman-legalteam) @RobH: right now contract is through 6/30/2017. Thanks! [19:55:27] (03PS1) 10BryanDavis: wikitech: Add a use statement for LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339685 [19:55:41] andrewbogott: ^ thats what I changed [19:56:22] (03CR) 10Andrew Bogott: [C: 031] "this seems better!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339685 (owner: 10BryanDavis) [19:56:25] bd808: thanks [19:57:02] RainbowSprinkles: ^^ wanna do a wikitech only config merge & sync? [19:57:56] * RainbowSprinkles looks [19:58:00] * RainbowSprinkles munches on his lunch [19:58:24] (03CR) 10Chad: [C: 032] wikitech: Add a use statement for LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339685 (owner: 10BryanDavis) [19:59:34] (03Merged) 10jenkins-bot: wikitech: Add a use statement for LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339685 (owner: 10BryanDavis) [20:01:48] (03CR) 10jenkins-bot: wikitech: Add a use statement for LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339685 (owner: 10BryanDavis) [20:02:38] thanks RainbowSprinkles [20:02:55] thanks [20:03:05] (03PS1) 10RobH: shell access for Shreyas Lakhtakia [puppet] - 10https://gerrit.wikimedia.org/r/339686 [20:03:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053762 (10RobH) I forgot to note that typically we list someone to be notified when... [20:04:09] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053764 (10RobH) [20:04:37] andrewbogott: I re-ran scap pull and also cleaned up the old branch crud that it warned about [20:04:38] bblack: ema: hi! Very specific question here: is there a way for Multicast HTCP purging to purge using a regex instead of an exact URL? https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [20:05:13] !log demon@tin Synchronized wmf-config/wikitech.php: (no justification provided) (duration: 00m 48s) [20:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:28] No justification provided, tsk tsk [20:05:32] * RainbowSprinkles self-shames [20:09:57] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:14:50] (03PS1) 10Dzahn: dhcp/site: add bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339687 (https://phabricator.wikimedia.org/T156506) [20:17:30] (03PS2) 10Dzahn: dhcp/site: add bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339687 (https://phabricator.wikimedia.org/T156506) [20:19:52] (03PS3) 10Dzahn: dhcp/site: add bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339687 (https://phabricator.wikimedia.org/T156506) [20:21:59] (03CR) 10Dzahn: [C: 032] dhcp/site: add bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339687 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:24:04] (03PS2) 10Gergő Tisza: Use custom LogstashFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) [20:25:00] (03CR) 10Gergő Tisza: "It does not necessarily have to wait for the train, the formatter file already exists." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) (owner: 10Gergő Tisza) [20:37:57] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:39:28] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2828442 (10dpatrick) I've reviewed both content and technical implementation of the 2016 Annual Report and found no major security problems. Here are a few... [20:45:43] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3053861 (10ZMcCune) Thank you so much @dpatrick! - Will check spelling on financials.html - Legal reviewed the video and said all content was OK... [20:50:47] !log restart elasticsearch on logstash1002 [20:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:49] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053888 (10Jgreen) I verified Shreyas's SSH public key by Google (video) Hangout today. [20:58:37] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3053891 (10Jgreen) [21:01:26] 06Operations, 15User-greg: content removed from etherpad - https://phabricator.wikimedia.org/T158987#3053893 (10greg) 05Open>03Resolved a:03greg I deleted the vandal content, there's a backup of the useful content in a gdoc. I'll let the team decide where they want to work in the future (same pad or new... [21:05:19] ebernhardson: Em... Did you do something? The logs are back! [21:05:29] :) [21:05:56] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3053913 (10Krinkle) >>! In T158782#3050674, @Gehel wrote: > For reference, the Apache configuration backing the portal... [21:06:50] Pchelolo: heh, only thing i did was restart elsaticsearch on logstash1002 [21:07:07] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:21] I would guess at 20:52 UTC ? [21:07:32] (03CR) 10Krinkle: "The redirect goes directly from www.wikipedia.org/portal/x to en.wikipedia.org/w/404.php. So it seems something is first rewriting the url" [puppet] - 10https://gerrit.wikimedia.org/r/339657 (https://phabricator.wikimedia.org/T158782) (owner: 10Gehel) [21:07:39] that was weird.. Thank you ebernhardson for help [21:07:41] i !log'd restarting elaticsearch here at 12:50, which was before i ssh'd in and restarted ... so 12:52 is a pretty good guess [21:08:06] thats even more suspicious then though :S [21:08:20] because we had no warnings or anything from elasticsearch, but restarting it magically fixed something [21:08:26] :/ [21:09:46] (03PS1) 10Dzahn: install: don't use http install method for bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339698 [21:11:17] (03PS2) 10Dzahn: install: don't use http install method for bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339698 [21:12:20] (03PS3) 10Dzahn: install: don't use http install method for bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339698 (https://phabricator.wikimedia.org/T156506) [21:12:55] (03PS4) 10Dzahn: install: don't use http install method for bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339698 (https://phabricator.wikimedia.org/T156506) [21:15:13] (03CR) 10Dzahn: [C: 032] install: don't use http install method for bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/339698 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:16:44] (03CR) 10Dzahn: "I do see that https://apt.wikimedia.org/tftpboot/jessie-installer/ is available and both files are there, so not sure yet what is going w" [puppet] - 10https://gerrit.wikimedia.org/r/339698 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:28:32] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3054048 (10Krinkle) >>! In T123728#3049049, @fgiunchedi wrote: > @Krinkle coal is fixed (the web part wasn't working but data collection was) and it broke as part of moving graphite1001 to jessie. No... [21:36:07] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:43:05] (03PS1) 10Andrew Bogott: labtest: Update 'admin' project id [puppet] - 10https://gerrit.wikimedia.org/r/339757 [21:47:00] (03CR) 10Andrew Bogott: [C: 032] labtest: Update 'admin' project id [puppet] - 10https://gerrit.wikimedia.org/r/339757 (owner: 10Andrew Bogott) [21:50:37] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [21:55:19] downtime for wdqs1001 extended, cc SMalyshev [22:01:01] 06Operations, 06Discovery, 06Services (watching), 15User-mobrovac: Set up Logstash behind LVS - https://phabricator.wikimedia.org/T159004#3054130 (10mobrovac) [22:01:50] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3054146 (10chasemp) [this is based on me talking to @ArielGlenn about this task and us trying to work things out -- hopefully I am summarizing well] **Dumps User classes:** #... [22:02:43] 06Operations, 06Discovery, 06Services (watching), 15User-mobrovac: Set up Logstash behind LVS - https://phabricator.wikimedia.org/T159004#3054130 (10demon) The elasticsearch part is probably trivial (we already do it for search), but Logstash itself might not be. There's a couple of ingestion routes and th... [22:04:27] 06Operations, 06Discovery, 06Services (watching), 15User-mobrovac: Set up Logstash behind LVS - https://phabricator.wikimedia.org/T159004#3054162 (10EBernhardson) One potential problem (or maybe not, i don't entirely know how LVS works) is that GELF can be sent as compressed, chunked UDP packets. If there... [22:16:34] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3054172 (10chasemp) The above may be too terse but I hope it makes sense, and one of the assumptions therein is we continue the dumps (ariel) and labs (wmcs) partnership here. We... [22:19:57] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1805.819759 Seconds [22:20:25] (03Draft1) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:20:30] (03PS2) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:20:58] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:20:58] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 21.188625 Seconds [22:21:40] (03PS3) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:23:33] !log smalyshev@tin Started deploy [wdqs/wdqs@62354ed]: Deploy new updater on 2001 for testing [22:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:59] !log smalyshev@tin Finished deploy [wdqs/wdqs@62354ed]: Deploy new updater on 2001 for testing (duration: 00m 26s) [22:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:57] !log smalyshev@tin Started deploy [wdqs/wdqs@62354ed]: Deploy new updater on 1001 for timeout increase [22:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:13] !log smalyshev@tin Finished deploy [wdqs/wdqs@62354ed]: Deploy new updater on 1001 for timeout increase (duration: 00m 16s) [22:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:26] SMalyshev: ? what's up? Why the Friday deploy? [22:30:17] greg-g: it's maintenance instance [22:30:30] not in production right now, debugging some timeout issues [22:30:34] gotcha [22:30:37] thanks :) [22:30:46] I didn't touch the production ones [22:32:29] (03Draft1) 10Paladox: Phabricator: Migrate to base::service_unit for systemd [puppet] - 10https://gerrit.wikimedia.org/r/339763 [22:32:33] (03PS2) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 [22:32:39] (03PS3) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 [22:33:34] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (owner: 10Paladox) [22:34:01] (03PS4) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 [22:37:10] (03PS2) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [22:37:58] (03PS3) 10Rush: nova: monitor for fullstack test daemon [puppet] - 10https://gerrit.wikimedia.org/r/339651 [22:43:47] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:46:28] !log (terbium) sql --write mediawikiwiki 'DELETE FROM module_deps' (in batches of 500; 42292 rows affected) - per T158105. [22:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:33] T158105: "PHP Warning: filemtime(): No such file or directory" about files removed over a year ago - https://phabricator.wikimedia.org/T158105 [22:49:57] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:54:57] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:08:36] (03PS3) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 [23:08:45] (03PS4) 10Krinkle: webperf: Remove unused deprecate.py [puppet] - 10https://gerrit.wikimedia.org/r/338929 [23:11:47] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:23:57] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:32:47] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues