[00:01:09] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [00:06:49] (03PS3) 10Andrew Bogott: Remove a couple of unused settings from ldap config [puppet] - 10https://gerrit.wikimedia.org/r/288536 [00:06:51] (03PS1) 10Andrew Bogott: Define labsldapconfig for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/288555 [00:07:59] ebernhardson: oh I didn't see your confirmation, thanks for testing. [00:08:41] (03CR) 10Andrew Bogott: [C: 032] Define labsldapconfig for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/288555 (owner: 10Andrew Bogott) [00:08:54] so SWAT done at 23:58 UTC, we were in time :) [00:11:24] :) [00:11:32] !log catrope@tin Synchronized php-1.27.0-wmf.23/extensions/Echo/includes/api/ApiEchoNotifications.php: attempt to fix notices/fatals in Echo (duration: 00m 25s) [00:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:03] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/includes/api/ApiEchoNotifications.php: attempt to fix notices/fatals in Echo (duration: 00m 29s) [00:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:08] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.424 second response time [00:18:08] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.424 second response time [00:18:09] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.256 second response time [00:18:09] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.256 second response time [00:19:08] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.079 second response time [00:19:08] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.079 second response time [00:23:08] ^was legit broke and is now being fixed :) [00:27:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:27:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:29:31] (03PS1) 10Chad: Move remaining wikis to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288557 [00:30:27] (03CR) 10Chad: [C: 032] Move remaining wikis to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288557 (owner: 10Chad) [00:31:12] (03Merged) 10jenkins-bot: Move remaining wikis to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288557 (owner: 10Chad) [00:31:59] (03PS3) 10Dzahn: grafana: move files from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/288516 [00:32:36] (03PS4) 10Dzahn: grafana: move files from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/288516 [00:32:48] (03CR) 10Dzahn: [C: 032] grafana: move files from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/288516 (owner: 10Dzahn) [00:34:21] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: Remaining wikis to wmf.1 [00:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:35:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:36:21] (03CR) 10Dzahn: "checked on krypton. no-op" [puppet] - 10https://gerrit.wikimedia.org/r/288516 (owner: 10Dzahn) [00:38:21] Im using mediawiki 1.28 wmf 1 but when i got to /mw-config/ im shown with config-title [00:38:34] It isen't translated in /mw-config/ [00:38:38] Dosen't show english. [00:38:38] PROBLEM - MariaDB Slave SQL: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2033-bin.000012, end_log_pos 558628604 [00:38:39] PROBLEM - MariaDB Slave SQL: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2033-bin.000012, end_log_pos 558628604 [00:39:05] ostriches ^^ [00:39:21] mw-config? We don't use that in production. [00:39:47] ostriches oh, i use it on my own website https://en.random-wikisaur.tk/mw-config/ [00:40:05] Oh, well file a bug then? [00:40:28] ostriches: Ok [00:43:50] PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.10 seconds [00:43:50] PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.10 seconds [00:50:39] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [00:50:40] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [00:51:11] why do we keep getting double icinga-wm [00:51:21] not usually a problem like that [00:52:09] ostriches: I found the cause, it was https://github.com/wikimedia/mediawiki/commit/f7dad57c64db3eb1296894c2d3ae97b9f7f27c4c#diff-268cf4a97e2c7483271ee0e4356afe8d that caused it [00:52:13] Should i revert it. [00:53:04] !log killing multiple icinga-wm processes on neon, running puppet. manually started and by puppet as well? [00:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:31] (03PS4) 10Andrew Bogott: Remove a couple of unused settings from ldap config [puppet] - 10https://gerrit.wikimedia.org/r/288536 [01:04:33] (03PS1) 10Andrew Bogott: Add labtest-specific ldap config settings [puppet] - 10https://gerrit.wikimedia.org/r/288560 [01:09:19] (03PS1) 10BBlack: common VCL: clear XCP if not from local nginx [puppet] - 10https://gerrit.wikimedia.org/r/288561 [01:20:18] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [01:20:59] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: puppet fail [01:21:56] (03CR) 10BBlack: [C: 032] common VCL: clear XCP if not from local nginx [puppet] - 10https://gerrit.wikimedia.org/r/288561 (owner: 10BBlack) [01:35:39] (03PS1) 10Yuvipanda: tools: Enable the Registry Enforcer [puppet] - 10https://gerrit.wikimedia.org/r/288563 (https://phabricator.wikimedia.org/T133515) [01:35:59] (03PS2) 10Yuvipanda: tools: Enable the Registry Enforcer [puppet] - 10https://gerrit.wikimedia.org/r/288563 (https://phabricator.wikimedia.org/T133515) [01:38:46] (03PS3) 10Yuvipanda: tools: Enable the Registry Enforcer [puppet] - 10https://gerrit.wikimedia.org/r/288563 (https://phabricator.wikimedia.org/T133515) [01:39:01] (03PS4) 10Yuvipanda: tools: Enable the Registry Enforcer [puppet] - 10https://gerrit.wikimedia.org/r/288563 (https://phabricator.wikimedia.org/T133515) [01:46:29] (03CR) 10Yuvipanda: [C: 032] tools: Enable the Registry Enforcer [puppet] - 10https://gerrit.wikimedia.org/r/288563 (https://phabricator.wikimedia.org/T133515) (owner: 10Yuvipanda) [01:47:09] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:50:19] (03PS1) 10Yuvipanda: labstore: Enable dumps on wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/288566 (https://phabricator.wikimedia.org/T135205) [01:50:32] (03PS2) 10Yuvipanda: labstore: Enable dumps on wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/288566 (https://phabricator.wikimedia.org/T135205) [01:50:39] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Enable dumps on wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/288566 (https://phabricator.wikimedia.org/T135205) (owner: 10Yuvipanda) [01:58:31] grrrit-wm: say something! [01:58:57] (03Abandoned) 10Yuvipanda: [WIP] postgres: Provision credentials for all users / services [puppet] - 10https://gerrit.wikimedia.org/r/211091 (owner: 10Yuvipanda) [01:59:01] good [02:09:18] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: puppet fail [02:15:29] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [02:22:00] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 08m 17s) [02:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 13 02:30:49 UTC 2016 (duration 8m 50s) [02:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:50] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:39:38] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [03:04:39] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail [03:07:08] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:32:09] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [05:27:40] !log reimporting x1 on dbstore2002 [05:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:35:59] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.45 seconds [05:36:28] PROBLEM - MariaDB Slave Lag: s5 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.13 seconds [05:36:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.10 seconds [05:37:18] PROBLEM - MariaDB Slave Lag: s7 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 689.59 seconds [05:37:19] PROBLEM - MariaDB Slave Lag: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 694.27 seconds [05:39:49] RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [05:40:28] RECOVERY - MariaDB Slave Lag: s5 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 282.20 seconds [05:40:29] RECOVERY - MariaDB Slave SQL: x1 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:40:49] RECOVERY - MariaDB Slave Lag: s1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [05:41:19] RECOVERY - MariaDB Slave Lag: s6 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [05:43:10] RECOVERY - MariaDB Slave Lag: s7 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [05:43:50] that was me stopping replication for the import [05:47:49] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 39.22 seconds [06:24:07] (03CR) 1020after4: "@alexandros: the usual semantics of a conf.d directory is completely different than the way userkeys/user.d/ is implemented. I was simply " [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [06:24:17] (03Abandoned) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [06:25:59] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291795 (10mobrovac) Considering that the RT testing infra has been running on Node 4 for quite a while now and the data you collected, I think this is ready to move on. > Given that... [06:27:39] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291799 (10mobrovac) [06:27:42] 06Operations, 10Parsoid, 06Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#2291800 (10mobrovac) [06:27:46] 06Operations, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2291798 (10mobrovac) [06:28:41] 06Operations, 10Parsoid, 06Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#1972341 (10mobrovac) [06:28:43] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10mobrovac) [06:30:39] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291805 (10mobrovac) Also see related tasks for Beta and CI: - {T125003} - {T126992} [06:30:59] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:49] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:59] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:19] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:29] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:19] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:14] (03CR) 10Mobrovac: [C: 04-1] "The Cassandra cluster name is missing." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [06:54:08] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [06:54:36] (03PS33) 1020after4: scap::target keyholder-managed ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [06:56:09] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] (03PS34) 1020after4: scap::target keyholder-managed ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [06:58:00] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:00] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] (03CR) 1020after4: "faidon: key generation has been removed." [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [06:58:19] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:38] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:10] (03PS35) 1020after4: scap::target keyholder-managed ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [07:29:44] (03PS1) 10Mobrovac: Change Prop: Increase heap limit to 750 MB [puppet] - 10https://gerrit.wikimedia.org/r/288580 (https://phabricator.wikimedia.org/T134456) [07:35:36] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259756 (10Joe) While I appreciate some of the principles (but not all) put forward by @Dzahn and @JanZerebecki, I am used to reason in terms of reality and not in terms of principles: Mar... [07:36:06] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2291863 (10Joe) a:03Joe [07:36:25] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259756 (10Joe) @GWicke we still need your approval though. [07:43:39] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [07:44:24] (03CR) 10Mobrovac: "PCC is OK with it - https://puppet-compiler.wmflabs.org/2789/" [puppet] - 10https://gerrit.wikimedia.org/r/288580 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [07:44:28] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291872 (10ema) >>! In T134989#2290254, @BBlack wrote: > 1. We've got 2x upstream bugfixes applied to our varnishd on cache_mi... [07:47:23] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2291873 (10Joe) The correct endpoint to use for mediawiki, restbase, etc are (in puppet terms) MW Api: http:... [07:52:48] (03PS3) 10Muehlenhoff: Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 [07:55:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Change Prop: Increase heap limit to 750 MB [puppet] - 10https://gerrit.wikimedia.org/r/288580 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [07:57:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 (owner: 10Muehlenhoff) [07:57:15] (03PS4) 10Muehlenhoff: Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 [07:57:22] (03CR) 10Muehlenhoff: [V: 032] Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 (owner: 10Muehlenhoff) [07:58:58] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 15 of table commonswiki.filearchive cannot be converted from type varchar(100) to type varbinary(32) [07:59:45] doing [08:03:08] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:09:49] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:53:30] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291971 (10ema) Issue reproduced: https://phabricator.wikimedia.org/P3067 [09:09:17] (03PS1) 10Lokal Profil: Make sul icons square and use global defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) [09:10:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 654 [09:11:06] (03CR) 10Lokal Profil: "The pngs are simply the Inkscape exports from the official logos. Likely they should be run through some png cleanup util. Happy to do so " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [09:11:56] !log start backfilling cassandra metrics from graphite1001 to graphite1003 [09:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:08] RECOVERY - check_mysql on lutetium is OK: Uptime: 216799 Threads: 1 Questions: 4292978 Slow queries: 11639 Opens: 52800 Flush tables: 2 Open tables: 64 Queries per second avg: 19.801 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:48:44] !log nodetool cleanup on restbase2006 T132976 [09:48:45] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [09:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:52:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] "This is already set on all Linux systems with >= 4.4 without problems, so merging now." [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [09:52:02] (03PS2) 10Filippo Giunchedi: graphite: deprecate 'carbonctl check' [puppet] - 10https://gerrit.wikimedia.org/r/288161 [09:52:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: deprecate 'carbonctl check' [puppet] - 10https://gerrit.wikimedia.org/r/288161 (owner: 10Filippo Giunchedi) [09:52:14] (03PS8) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [09:52:25] (03CR) 10Muehlenhoff: [V: 032] Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [10:01:10] (03PS2) 10Muehlenhoff: Add ferm rules for role::mariadb::tendril [puppet] - 10https://gerrit.wikimedia.org/r/285946 [10:02:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for role::mariadb::tendril [puppet] - 10https://gerrit.wikimedia.org/r/285946 (owner: 10Muehlenhoff) [10:02:57] (03PS2) 10Muehlenhoff: Enable base::firewall on db1011 [puppet] - 10https://gerrit.wikimedia.org/r/285947 [10:03:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on db1011 [puppet] - 10https://gerrit.wikimedia.org/r/285947 (owner: 10Muehlenhoff) [10:14:43] 06Operations, 07Graphite, 13Patch-For-Review: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889#2292066 (10fgiunchedi) the two new machines are now live and hosting cassandra metrics, e.g. in eqiad at the time of the switch: {F4005006} {F4005009} [10:21:21] 06Operations, 07Graphite, 13Patch-For-Review: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889#2292088 (10fgiunchedi) left TODO: * cassandra metrics backfill ** graphite1001 -> graphite1003 ** graphite1003 -> graphite2002 * graphite browser at https://graphite.wikimedi... [10:23:10] 06Operations: Update firejail to 0.38 - https://phabricator.wikimedia.org/T121756#2292090 (10MoritzMuehlenhoff) [10:27:01] !log depooling cp1061 trying to reproduce T134989 [10:27:02] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [10:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:54] (03PS1) 10Giuseppe Lavagetto: mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) [10:33:05] (03CR) 10jenkins-bot: [V: 04-1] mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [10:34:06] <_joe_> mobrovac: ^^ [10:35:32] (03PS2) 10Giuseppe Lavagetto: mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) [10:36:54] (03CR) 10jenkins-bot: [V: 04-1] mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [10:39:08] (03PS3) 10Giuseppe Lavagetto: mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) [10:41:33] (03CR) 10Giuseppe Lavagetto: [C: 032] mobileapps: add experimental cluster check [puppet] - 10https://gerrit.wikimedia.org/r/288589 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [10:45:58] <_joe_> meh [10:47:02] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [10:48:08] (03PS1) 10Giuseppe Lavagetto: lvs::monitor: fix duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/288591 [10:48:08] <_joe_> this is me ^^ [10:48:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] lvs::monitor: fix duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/288591 (owner: 10Giuseppe Lavagetto) [10:54:13] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292155 (10Jonas) Issue appearing again ``` curl -v 'https://query.wikidata.org/i18n/en.json' * Hostname was NOT found in DNS... [10:54:57] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:16] (03PS1) 10Giuseppe Lavagetto: lvs::monitor: actual, not virtual resources. [puppet] - 10https://gerrit.wikimedia.org/r/288592 [11:00:05] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs::monitor: actual, not virtual resources. [puppet] - 10https://gerrit.wikimedia.org/r/288592 (owner: 10Giuseppe Lavagetto) [11:04:11] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292181 (10ema) >>! In T134989#2292155, @Jonas wrote: > Issue appearing again [...] > < Age: 58 > < X-Cache: cp1045 hit+miss(... [11:09:33] !log uploaded firejail 0.38-1+wmf1 for trusty-wikimedia to carbon [11:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:28] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service, 13Patch-For-Review: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2292192 (10Joe) Mobileapps is now monitored via service_checker on the LVS IPs, and will send emails... [11:19:04] (03CR) 10Eranroz: [C: 031] Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [11:28:33] !log repooling cp1061 [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:54] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2292205 (10akosiaris) >>! In T135176#2291795, @mobrovac wrote: > Considering that the RT testing infra has been running on Node 4 for quite a while now and the data you collected, I t... [11:37:44] (03CR) 10Alexandros Kosiaris: "I thought so too, turns out that it is. But I 'll take a look into keyholder auth in order to implicitly add the ops group everywhere to a" [puppet] - 10https://gerrit.wikimedia.org/r/288425 (owner: 10Alexandros Kosiaris) [11:46:46] what's with the post-merge build failure in https://gerrit.wikimedia.org/r/#/c/288201/ ? [11:46:49] pplint [11:47:02] seems like someone forced through some other change that failed pplint? [12:12:55] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [12:15:07] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2292274 (10mobrovac) >>! In T135176#2292205, @akosiaris wrote: > When would you like this done? A proposal on my part is end of next week (week of 16th, say Thursday ?) for the first... [12:15:25] bblack: seems so, my patch valiated fine until PS7, I'll check the past merged commits [12:16:07] moritzm: yeah I don't think it's your patch at all, I think someone snuck something in [12:24:35] (03PS3) 10Muehlenhoff: Install firejail on image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/288379 (https://phabricator.wikimedia.org/T135111) [12:28:59] bblack: https://blogs.dropbox.com/tech/2016/05/enabling-http2-for-dropbox-web-services-experiences-and-observations/ [12:29:40] :) [12:39:52] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:12:18] 06Operations, 10ops-eqiad: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2292408 (10chasemp) a:03Cmjohnson tossing your way @cmjohnson as I'm guessing this is all you :) [13:13:56] !log depooling cp3007 to try reproducing T134989 [13:13:57] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [13:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:07] 06Operations, 10ops-eqiad: db1023 Degraded RAID - https://phabricator.wikimedia.org/T135157#2292411 (10chasemp) a:03chasemp @cmjohnson handing over to you so it is on your radar [13:22:16] (03PS1) 10Yuvipanda: [WIP] Add 'hostautomounter' admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288600 [13:25:01] 06Operations, 10Traffic: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10fgiunchedi) [13:26:41] _joe_: ^ WIP of admission controller. Don't bother reviewing yet, haven't even written tests / has syntax issues that i'm fixing now [13:27:20] <_joe_> YuviPanda: in the meantime, I'm looking into deploying go binaries via scap3 [13:27:43] _joe_: ah, ok :) [13:27:44] (03PS1) 10Rush: update checker.tools.wmflabs.org toolscron endpoint [puppet] - 10https://gerrit.wikimedia.org/r/288601 [13:28:22] _joe_: I don't think setting up a new scap3 master is documented anywhere, would be nice if we can have one for tools [13:29:58] <_joe_> YuviPanda: nothing about scap3 is documented well atm [13:30:22] * YuviPanda nods [13:31:10] _joe_: long term things that need tackling but I am not even touching atm are scap3, DNS and Ingress. [13:31:32] <_joe_> YuviPanda: next in line would be DNS for me [13:31:47] nice [13:39:41] (03PS2) 10Rush: update checker.tools.wmflabs.org toolscron endpoint [puppet] - 10https://gerrit.wikimedia.org/r/288601 [13:41:26] PROBLEM - jenkins_zmq_publisher on gallium is CRITICAL: Connection refused [13:42:31] <_joe_> hashar: ^^ [13:42:37] <_joe_> something I can do? [13:42:40] oh [13:42:51] asked in -releng as well since alert popped up there [13:42:57] I didn't know we used zmq in CI [13:43:01] !log Restarted Jenkins, the zmq plugin daemon did not start [13:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:13] _joe_: all my fault / Jenkins fault [13:43:26] RECOVERY - jenkins_zmq_publisher on gallium is OK: TCP OK - 0.000 second response time on port 8888 [13:43:54] hmm [13:43:58] I did nothing really [13:44:15] !log Jenkins zmq is all fine [13:44:20] (03CR) 10Rush: [C: 032 V: 032] "second patch was just comments" [puppet] - 10https://gerrit.wikimedia.org/r/288601 (owner: 10Rush) [13:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:25] _joe_: my theory is the icinga check kicked in while Jenkins was rebooting. It is all fine right now [13:45:31] <_joe_> ok [13:46:03] chasemp: yup Jenkins exposes its event over zmq. Nodepool subscribe to the feed to notice when a job is complete and garbage collect the disposable instances [13:46:20] 06Operations, 10Traffic: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292474 (10fgiunchedi) more data from a captured tcpdump, see also the big udp packet size ```lines=5 13:44:08.922449 IP cp4002.ulsfo.wmnet.9705 > graphite1001.eqiad.wmnet.8125: UDP, le... [13:46:23] ah it's nodepool gotcha [13:49:18] 06Operations, 10Citoid, 10Graphoid, 06Services, 10service-template-node: SCA services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2292480 (10mobrovac) [13:49:23] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292479 (10mobrovac) [13:51:15] (03PS1) 10Rush: icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 [13:52:15] 06Operations, 10Citoid, 10Graphoid, 06Services, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2292481 (10mobrovac) [13:52:49] (03PS1) 10Ottomata: Fix default replication factor in clusters with less than 3 brokers (labs) [puppet] - 10https://gerrit.wikimedia.org/r/288604 (https://phabricator.wikimedia.org/T121562) [13:53:05] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292488 (10mobrovac) I'm currently working on {T97530} which will allow you to easily switch to using interna... [13:53:32] (03PS1) 10BBlack: fix varnishxcps4 junk output T135227 [puppet] - 10https://gerrit.wikimedia.org/r/288605 [13:53:44] 06Operations, 10Traffic, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10Krinkle) > varnish.clients.d.42387 t=1463146627656326:1|c Looks like these don't belong, either. Collecting it in some way could be useful (T131894) - b... [13:53:55] (03CR) 10Yuvipanda: [C: 031] icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 (owner: 10Rush) [13:54:13] (03CR) 10BBlack: [C: 032 V: 032] fix varnishxcps4 junk output T135227 [puppet] - 10https://gerrit.wikimedia.org/r/288605 (owner: 10BBlack) [13:55:59] (03PS2) 10Ottomata: Fix default replication factor in clusters with less than 3 brokers (labs) [puppet] - 10https://gerrit.wikimedia.org/r/288604 (https://phabricator.wikimedia.org/T121562) [13:56:17] (03CR) 10Ottomata: [C: 032 V: 032] Fix default replication factor in clusters with less than 3 brokers (labs) [puppet] - 10https://gerrit.wikimedia.org/r/288604 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [13:57:43] (03PS4) 10Elukey: Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) [13:57:57] !log installing and testing sqlproxy on dbproxy1005 [13:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:58] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292558 (10Yurik) >>! In T134241#2291873, @Joe wrote: > The correct endpoint to use for mediawiki, restbase,... [14:01:10] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292563 (10mobrovac) >>! In T134241#2292558, @Yurik wrote: > Joe, should I manually specify the `Host` value... [14:04:50] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292578 (10Yurik) @mobrovac thx, but `mwApiGet` implies mw `api.php`. What about thumbnails from commons, pag... [14:12:28] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2292602 (10elukey) === Summary === * memcached 1.4.25 limits the number of slab classes to 63. In our use case, this means that the 62nd class for us would be... [14:15:46] 06Operations, 10ops-eqiad: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2292645 (10Cmjohnson) @jgreen, We will need to scheduled down time to replace the disk. Also, please make sure grub is on /dev/sdb [14:17:18] (03PS1) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 [14:23:41] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292672 (10Dzahn) [14:23:55] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10Arlolra) > For what is worth, and given binary node_modules between nodejs versions not being compatible, during the migration period, no parsoid deploys should happen. Cor... [14:24:58] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292678 (10mobrovac) >>! In T134241#2292578, @Yurik wrote: > @mobrovac thx, but `mwApiGet` implies mw `api.ph... [14:29:17] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2292698 (10ssastry) What @arlolra said and @akosiaris's proposed timeline looks good to me. [14:31:44] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2292703 (10Ottomata) We didn't get a chance to fully restart each broker with `inter.broker.protocol.version=0.9.0.X` this w... [14:34:15] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2292709 (10Yurik) Shouldn't we have some "automagic mapper" that changes any `https://publichost/` into a `ht... [14:38:22] (03PS1) 10Alexandros Kosiaris: Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 [14:39:35] (03CR) 10jenkins-bot: [V: 04-1] Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 (owner: 10Alexandros Kosiaris) [14:40:39] Krenair: hey, Do you have time to work on the LDAP for graphite? [14:40:51] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2292730 (10Nikki) In the description, it says the project namespace should be called "Wikipidia", but on jam.wikipedia.org it seems to still be... [14:41:07] (03CR) 10Yuvipanda: Introduce service::uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288613 (owner: 10Alexandros Kosiaris) [14:41:33] (or direct me to a manual to do it and I make patches/etc.) [14:41:45] (03CR) 10Andrew Bogott: [C: 04-1] icinga tools.checker Tools paging update (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288603 (owner: 10Rush) [14:42:37] (03PS2) 10Rush: icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 [14:42:47] (03PS2) 10Alexandros Kosiaris: Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 [14:43:17] (03CR) 10Andrew Bogott: [C: 031] icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 (owner: 10Rush) [14:43:55] (03CR) 10jenkins-bot: [V: 04-1] Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 (owner: 10Alexandros Kosiaris) [14:44:21] (03PS2) 10Andrew Bogott: Add labtest-specific ldap config settings [puppet] - 10https://gerrit.wikimedia.org/r/288560 [14:45:46] (03PS1) 10Andrew Bogott: Added labtest-specific ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/288615 [14:45:59] (03CR) 10Andrew Bogott: [C: 032] Add labtest-specific ldap config settings [puppet] - 10https://gerrit.wikimedia.org/r/288560 (owner: 10Andrew Bogott) [14:46:36] (03CR) 10Andrew Bogott: [C: 032 V: 032] Added labtest-specific ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/288615 (owner: 10Andrew Bogott) [14:47:48] (03CR) 10Rush: "thank you for continuing the lt- convention :)" [labs/private] - 10https://gerrit.wikimedia.org/r/288615 (owner: 10Andrew Bogott) [14:50:31] (03PS1) 10Ladsgroup: grafana: give access to "wikidev" LDAP memebers [puppet] - 10https://gerrit.wikimedia.org/r/288616 [14:52:36] (03CR) 10Ladsgroup: "Related discussion: https://phabricator.wikimedia.org/T134651" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [14:57:32] gehel: WDQS alert? [14:57:40] (UNKNOWN for 2 days 2 hours) [14:57:49] godog: graphite1001 alerts too [14:58:19] andrewbogott, chasemp: labcontrol1002 ~2d disk WARNING too [14:58:35] should I just /nick icinga-wm-2? :) [14:58:47] huh — that's an interesting one [14:58:56] heh thank you for the heads up paravoid [14:59:23] heheh looking paravoid [14:59:24] chasemp: that should have been fixed by —delete-after, I'm looking now [14:59:38] andrewbogott: it's all glance I think [14:59:40] kk [14:59:42] yup [15:00:14] (03PS1) 10Alexandros Kosiaris: ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 [15:01:25] (03CR) 10jenkins-bot: [V: 04-1] ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [15:05:14] RECOVERY - check_raid on beryllium is OK: OK: LinuxRAID /dev/md/0: act=1, wk=1, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=0, sp=0: /dev/md/3: act=1, wk=1, fail=0, sp=0 [15:05:44] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286195 (10debt) Hi! Checking on the progress of this ticket... We're waiting for this to be looked at / fixed before we can update the wikipedia.org portal st... [15:05:46] 06Operations, 10ops-eqiad: db1023 Degraded RAID - https://phabricator.wikimedia.org/T135157#2292788 (10chasemp) a:05chasemp>03Cmjohnson [15:06:09] (03CR) 10Krinkle: [C: 04-1] "This same pattern (without wikidev) is also used for graphite and kibana. Not sure it should differ here." [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [15:06:12] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292790 (10debt) [15:07:32] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2292794 (10Jgreen) >>! In T135178#2292645, @Cmjohnson wrote: > @jgreen, We will need to scheduled down time to replace the disk. Also, > please make sure gru... [15:08:34] (03PS1) 10Yuvipanda: tools: Add python-requests-oauthlib package [puppet] - 10https://gerrit.wikimedia.org/r/288619 (https://phabricator.wikimedia.org/T130529) [15:08:56] (03PS2) 10Yuvipanda: tools: Add python-requests-oauthlib package [puppet] - 10https://gerrit.wikimedia.org/r/288619 (https://phabricator.wikimedia.org/T130529) [15:09:08] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add python-requests-oauthlib package [puppet] - 10https://gerrit.wikimedia.org/r/288619 (https://phabricator.wikimedia.org/T130529) (owner: 10Yuvipanda) [15:10:34] (03PS1) 10Hashar: (wip) Add puppet-lint to Rakefile (wip) [puppet] - 10https://gerrit.wikimedia.org/r/288620 [15:12:14] paravoid: sorry, was afk [15:12:51] (03CR) 10Ladsgroup: "So do you think we should give access to wikidev for graphtie and kibana too? or Do you think we shouldn't let non-NDA people get access t" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [15:13:05] paravoid: are you refering to the zero sized body issue? Or something else? [15:13:15] to an oustanding icinga alert [15:13:38] * gehel is checking... [15:16:50] looks like a flaky graphite_check... service looks fine... [15:20:06] (03PS1) 10Alexandros Kosiaris: scap3: add ops into keyholder trusted groups [puppet] - 10https://gerrit.wikimedia.org/r/288624 [15:20:20] (03PS3) 10Alexandros Kosiaris: phab::vcs: Add a proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/287081 [15:20:26] (03PS2) 10Rush: tool labs bastions tcl cgroup [puppet] - 10https://gerrit.wikimedia.org/r/288622 (https://phabricator.wikimedia.org/T131541) [15:21:44] (03Abandoned) 10Alexandros Kosiaris: deploy-service: add akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/288425 (owner: 10Alexandros Kosiaris) [15:22:47] (03PS4) 10Alexandros Kosiaris: phab::vcs: Add a proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/287081 [15:32:20] (03PS2) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [15:32:39] (03CR) 10Alexandros Kosiaris: [C: 032] phab::vcs: Add a proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/287081 (owner: 10Alexandros Kosiaris) [15:36:56] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2292869 (10GWicke) Approved. Also, I think @mobrovac is the only one in #services who even has shell on sc* right now. That's not a sustainable situation, but lets address that separately. [15:37:23] (03PS3) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [15:41:14] !log bounce graphite-web on graphite1001, debugging carbon-c-relay stalls while sending metrics [15:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:41] (03PS4) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [15:45:49] (03PS5) 10Ottomata: Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [15:46:45] (03PS3) 10Rush: tool labs bastions tcl cgroup [puppet] - 10https://gerrit.wikimedia.org/r/288622 (https://phabricator.wikimedia.org/T131541) [15:46:48] (03CR) 10jenkins-bot: [V: 04-1] Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [15:46:55] (03CR) 10Rush: [C: 032 V: 032] tool labs bastions tcl cgroup [puppet] - 10https://gerrit.wikimedia.org/r/288622 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:47:38] (03PS2) 10Yuvipanda: [WIP] Add 'hostautomounter' admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288600 [15:47:42] (03PS1) 10Elukey: Remove aqs1006 partman configuration to test why boot is failing after os install. [puppet] - 10https://gerrit.wikimedia.org/r/288626 (https://phabricator.wikimedia.org/T133785) [15:48:39] (03PS2) 10Elukey: Remove aqs1006 partman configuration to test why boot is failing after os install. [puppet] - 10https://gerrit.wikimedia.org/r/288626 (https://phabricator.wikimedia.org/T133785) [15:48:47] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/2793/tin.eqiad.wmnet/ pcc is happy and change does what is expected." [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [15:49:37] (03CR) 10Halfak: "If login is just an anti-vandalism measure, then it seems like it makes sense to open this to everyone with labs account. If there is som" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [15:49:38] _joe_: ^ needs a bit more docs but all tests pass now \o/ [15:49:39] (03CR) 10Elukey: [C: 032 V: 032] Remove aqs1006 partman configuration to test why boot is failing after os install. [puppet] - 10https://gerrit.wikimedia.org/r/288626 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [15:49:51] * YuviPanda is going brb [15:50:49] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292902 (10Krenair) >>! In T135029#2292786, @debt wrote: > Checking on the progress of this ticket... task status is stalled, we're waiting for ops [15:51:56] 06Operations: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2292904 (10Krenair) This is nothing to do with Differential, Diffusion, Audit or Phabricator [15:53:39] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2292923 (10greg) [15:55:13] (03CR) 10Alex Monk: "This was already discussed in T134651#2272511 - the ACL *should* be different from the others, which hold sensitive data, here, because th" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [15:57:37] (03PS5) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [15:58:32] (03CR) 10Halfak: "This is a fair point, but it is already true via JSONP anyway. In order for someone to operate on your behalf they'd need to get you to c" [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [15:59:44] paravoid: I'm looking at the failing icinga check for WDQS. There something wrong with the check... not sure what yet... [16:00:02] (03CR) 10Ladsgroup: "hmmm, So should we have another user group or add more people (wikidev in the beta cluster) to this LDAP group?" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [16:00:21] (03PS6) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [16:02:05] check_graphite_threshold seems to get data from something similar to "curl -v https://graphite.wikimedia.org/render\?format=json\&from=-5min\&target=varnish.eqiad.backends.be_wdqs1001.GET.p99", which fails... [16:04:21] seems that "from" should actually be "_from"... not sure why some of the similar checks pass... [16:06:25] gehel: I think I know what's going on, possibly related to having additional graphite machines, though the graphite version isn't the same, 0.9.12 vs 0.9.13 [16:06:28] (03CR) 10Alexandros Kosiaris: "changed the model a bit and made trusted_group defaulting to [] and being forced into an array. pcc is happy again after some minor change" [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [16:06:44] (03PS1) 10Alex Monk: Set meta namespace names for jamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288628 (https://phabricator.wikimedia.org/T134017) [16:07:15] godog: Thanks! I was really puzzled by this one... [16:07:40] gehel: heh, me too, trying to confirm better what's going on [16:08:08] godog: You seem to understand this better than I do, but let me know if I can help... [16:08:09] (03PS7) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [16:08:14] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline; plus, I think there are other trusted_group references across the tree." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [16:08:30] gehel: for sure, thanks! [16:13:06] (03CR) 10Krinkle: "Both in case of Kibana, Graphite and Grafana 'access' in the puppet file refers to write access. Which to a first approximation has no his" [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [16:14:01] 06Operations: Update firejail to 0.38 - https://phabricator.wikimedia.org/T121756#2292948 (10MoritzMuehlenhoff) The version for trusty has been updated to 0.38 (for upcoming use in scalers, but the sca cluster still needs to be upgraded) [16:16:12] (03CR) 10Faidon Liambotis: [C: 04-1] "This looks good -- and thanks for the extensive documentation!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [16:17:17] paravoid: -1 looks good? :) [16:17:27] there's more :) [16:17:35] ok :) [16:17:51] two tiny inline comments, also jenkins V-2ed it [16:23:05] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2292978 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete [16:25:45] (03PS1) 10Hashar: (wip) Speed up linting by only process HEAD (wip) [puppet] - 10https://gerrit.wikimedia.org/r/288629 [16:26:25] (03PS1) 10Thcipriani: Clean old scap code [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) [16:29:47] 06Operations, 10ops-codfw: ms-be2008.codfw.wmnet: slot=1 dev=sdl failed - https://phabricator.wikimedia.org/T131147#2292993 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete [16:30:22] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2292997 (10chasemp) >>! In T134798#2283060, @yuvipanda wrote: > We only need toolserver.org and www.toolserver.org I think. >>! In T134798#2283509, @Dzahn wrote: > So an existing... [16:31:56] PROBLEM - Disk space on ms-be2008 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error [16:32:07] (03PS8) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [16:32:35] (03CR) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [16:35:24] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:15:09] (03PS1) 10Mobrovac: Allow the output to be in YAML format [software/conftool] - 10https://gerrit.wikimedia.org/r/288632 [17:16:35] (03CR) 10jenkins-bot: [V: 04-1] Allow the output to be in YAML format [software/conftool] - 10https://gerrit.wikimedia.org/r/288632 (owner: 10Mobrovac) [17:16:46] damn [17:19:14] (03PS2) 10Mobrovac: Allow the output to be in YAML format [software/conftool] - 10https://gerrit.wikimedia.org/r/288632 [17:32:35] RECOVERY - Disk space on ms-be2008 is OK: DISK OK [17:38:34] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: Puppet has 1 failures [17:43:38] !log powercycle ms-be2008, sdb failed [17:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:20] (03CR) 1020after4: [C: 031] "yes please" [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) (owner: 10Thcipriani) [17:56:00] (03PS2) 10Thcipriani: Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 [17:56:40] (03CR) 10Thcipriani: [C: 031] "Now that scap 3.2.0 is in production, this can merge." [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [17:56:45] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: puppet fail [17:58:06] (03PS3) 10EBernhardson: Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 [18:00:51] (03PS1) 10Cmjohnson: Fixing typo for elastic1039 dhcpd file entry missing characters in mac [puppet] - 10https://gerrit.wikimedia.org/r/288638 [18:02:15] (03CR) 10Cmjohnson: [C: 032] Fixing typo for elastic1039 dhcpd file entry missing characters in mac [puppet] - 10https://gerrit.wikimedia.org/r/288638 (owner: 10Cmjohnson) [18:03:46] RECOVERY - puppet last run on mw2210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2293261 (10Cmjohnson) p:05Triage>03Low [18:08:15] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2293262 (10Cmjohnson) [18:08:55] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:04] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:24] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:20] 06Operations, 10ops-eqiad: Investigate cp1008 psu status - https://phabricator.wikimedia.org/T134888#2293267 (10Cmjohnson) 05Open>03declined this server is decom'd [18:11:43] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2293269 (10Cmjohnson) [18:13:12] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#2293270 (10egalvezwmf) [18:14:05] PROBLEM - confd service on cp3007 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [18:15:14] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:15:25] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:19:14] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:23:55] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:27:52] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2293302 (10Cmjohnson) [18:32:29] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661270 (10JanZerebecki) Changing your key in this task doesn't do anything. Either upload a patch that changes it in operations/puppet.git/modules/admin/data/dat... [18:32:32] 06Operations, 10ops-eqiad: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2293311 (10Cmjohnson) [18:35:56] remember webmin? it's still a thing and they have "cloudmin" now , a UI _on top of webmin_ :p [18:43:55] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [106250000.0] [18:45:16] ^looking [18:47:00] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2292710 (10Dzahn) confirmed the trusty file works: https://releases.wikimedia.org/debian/dists/... [18:48:35] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2293402 (10Dzahn) don't see error in apache log either on the backend (bromine). i think we must... [18:49:01] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2293403 (10Dzahn) P.S. still no reason that the files are owned by "5038" [19:08:24] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [93750000.0] [19:26:31] ok, time to fight with aqs1006 for a bit. [19:46:29] (03PS1) 10Mattflaschen: Disable cross-wiki notif. (both Beta and separate pref.) on non-SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288651 (https://phabricator.wikimedia.org/T135246) [19:48:54] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: puppet fail [19:49:48] (03CR) 10Catrope: [C: 031] Disable cross-wiki notif. (both Beta and separate pref.) on non-SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288651 (https://phabricator.wikimedia.org/T135246) (owner: 10Mattflaschen) [19:52:39] (03CR) 10Sbisson: [C: 031] Disable cross-wiki notif. (both Beta and separate pref.) on non-SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288651 (https://phabricator.wikimedia.org/T135246) (owner: 10Mattflaschen) [19:52:52] ostriches, can I deploy https://gerrit.wikimedia.org/r/288651 (cross-wiki notification hotfix)? [19:57:28] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2293607 (10RobH) So I disabled the second controller port and it boots into the OS. It seems the OS installs onto one of the ports, but the other port is conflicting... [19:57:45] matt_flaschen: fine by me [19:59:34] ostriches, thanks. https://gerrit.wikimedia.org/r/288654 too ? [19:59:55] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp1051 is CRITICAL: Connection refused [20:00:15] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp1051 is CRITICAL: Connection refused [20:01:20] (03PS1) 10BBlack: cache_misc: downgrade almost all to varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/288655 (https://phabricator.wikimedia.org/T134989) [20:01:25] PROBLEM - Varnishkafka log producer on cp1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:01:32] cp1051 is me, sorry [20:01:37] rest won't be so noisy :) [20:01:46] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: downgrade almost all to varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/288655 (https://phabricator.wikimedia.org/T134989) (owner: 10BBlack) [20:02:07] (03CR) 10Mattflaschen: [C: 032] Disable cross-wiki notif. (both Beta and separate pref.) on non-SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288651 (https://phabricator.wikimedia.org/T135246) (owner: 10Mattflaschen) [20:02:43] (03Merged) 10jenkins-bot: Disable cross-wiki notif. (both Beta and separate pref.) on non-SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288651 (https://phabricator.wikimedia.org/T135246) (owner: 10Mattflaschen) [20:04:35] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: puppet fail [20:05:52] !log starting varnish3 downgrade process for most of cache_misc [20:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:19] matt_flaschen: yes [20:07:15] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [20:09:44] !log mattflaschen@tin Synchronized wmf-config: Disable cross-wiki notifications entirely on non-SUL wikis and hide preference (duration: 00m 34s) [20:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:26] (03PS1) 10BBlack: Revert "cache_misc: do not deliver expired cached objects" [puppet] - 10https://gerrit.wikimedia.org/r/288656 (https://phabricator.wikimedia.org/T134989) [20:10:39] (03CR) 10BBlack: [C: 032 V: 032] Revert "cache_misc: do not deliver expired cached objects" [puppet] - 10https://gerrit.wikimedia.org/r/288656 (https://phabricator.wikimedia.org/T134989) (owner: 10BBlack) [20:11:56] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp1051 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.006 second response time [20:12:15] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp1051 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.002 second response time [20:12:25] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:13:16] RECOVERY - Varnishkafka log producer on cp1051 is OK: PROCS OK: 1 process with command name varnishkafka [20:16:55] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:44] 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2293782 (10hashar) on bromine, both files look exactly the same with same weird uid... [20:22:58] 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238#2293784 (10hashar) [20:28:25] (03PS1) 10RobH: Revert "Remove aqs1006 partman configuration to test why boot is failing after os install." [puppet] - 10https://gerrit.wikimedia.org/r/288690 [20:28:33] (03PS2) 10RobH: Revert "Remove aqs1006 partman configuration to test why boot is failing after os install." [puppet] - 10https://gerrit.wikimedia.org/r/288690 [20:30:03] (03CR) 10RobH: [C: 032] Revert "Remove aqs1006 partman configuration to test why boot is failing after os install." [puppet] - 10https://gerrit.wikimedia.org/r/288690 (owner: 10RobH) [20:32:23] (03PS1) 10Gehel: WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/288691 [20:33:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2293836 (10Smalyshev) 05Open>03Resolved [20:33:46] (03CR) 10jenkins-bot: [V: 04-1] WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/288691 (owner: 10Gehel) [20:35:25] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:04] (03PS1) 10Gehel: Revert "Depooling wdqs1001 for reinstall and new disks" [puppet] - 10https://gerrit.wikimedia.org/r/288693 [20:38:21] (03CR) 10Gehel: [C: 032] Revert "Depooling wdqs1001 for reinstall and new disks" [puppet] - 10https://gerrit.wikimedia.org/r/288693 (owner: 10Gehel) [20:39:10] gehel: your wdqs1001 depool won't take immediate effect everywhere and can't... [20:39:18] s/depool/un-depool/ [20:39:32] as cache_misc has a bunch of nodes with puppet disabled in the midst of downgrading varnish [20:39:35] RECOVERY - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is OK: TCP OK - 0.037 second response time on port 9042 [20:39:39] yep, puppet disabled... [20:39:52] please don't re-enable, or run puppet anywhere [20:40:42] bblack: change is already merged. Is that an issue for you? Should I revert? (Sorry, forgot about puppet being disabled) [20:41:09] bblack: there is no emergency to re-pool wdqs1001 [20:41:43] no need to revert, just don't touch the cache_misc nodes please :) [20:42:18] * gehel is not doing anything. At last a task where he is good! [20:42:31] :) [20:43:13] SMalyshev: ^ config changed is merged, but not activated. bblack is still working on fixing the world (or at least our varnish servers) [20:43:25] gehel: cool, thanks [20:44:25] just 5 more to go :) [20:45:36] oh actually, since I'm already past eqiad, it probably is taking effect there... [20:46:08] but there are 5 more nodes still puppet-disabled for downgrade, and two others that will stay puppet-disabled for good, and can't puppet those [20:48:15] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2293942 (10BBlack) Instructions for downgrading nodes to varnish3 (trialed on cache_misc): 1. disable puppet on affected nodes: 2. Update hieradata to remove varnish_version4 on... [20:50:45] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 2 failures [20:51:05] bblack: don't worry, WDQS is running just fine on a single server... It will be re-enabled when ... [20:52:45] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:52:47] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2293963 (10BBlack) I forgot one of our temporary hacks in the list above in T134989#2290254: 4. https://gerrit.wikimedia.org/r... [20:53:21] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2293964 (10RobH) Ok, So the boot from C is due to Jessie/Trusty detecting the second controller/port over the primary controller/port. The boot order has to be chang... [20:55:13] (03PS1) 10Catrope: Set $wgEchoCrossWikiNotifications to true on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288701 (https://phabricator.wikimedia.org/T135266) [20:55:15] (03PS1) 10Catrope: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) [20:58:34] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2293976 (10RobH) I fixed the bios boot order on aqs100[456], setting port #2 to primary allows the bios to boot in the order that the jessie/trusty installer detects... [21:00:56] (03CR) 10Mattflaschen: [C: 032] Set $wgEchoCrossWikiNotifications to true on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288701 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [21:01:35] (03Merged) 10jenkins-bot: Set $wgEchoCrossWikiNotifications to true on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288701 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [21:06:52] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Enable $wgEchoCrossWikiNotifications on the right wikis (unused for now) (duration: 00m 28s) [21:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:24] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [21:09:16] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:11:33] (03PS1) 10JanZerebecki: admin: add gehel to wdqs-admin [puppet] - 10https://gerrit.wikimedia.org/r/288706 [21:11:50] (03PS36) 1020after4: scap::target keyholder-managed ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [21:12:58] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2294011 (10BBlack) [21:13:00] 06Operations, 10Traffic: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2294012 (10BBlack) [21:13:03] 06Operations, 10Traffic, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2294013 (10BBlack) [21:13:06] 06Operations, 10Traffic, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2294007 (10BBlack) 05Resolved>03Open T134989 couldn't be resolved in a reasonable timeframe, and is corrupting some responses (zero body content length). I've reverted the bulk of c... [21:13:10] 06Operations, 10Traffic, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2294014 (10BBlack) [21:13:12] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) [21:16:33] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2294023 (10BBlack) Current State: * cp3007 and cp1045 are depooled from user traffic, icinga-downtimed for several days, and... [21:17:53] !log cache_misc varnish3 downgrade complete (except 3007 + 1045 - those remain depooled/downtimed/puppet-disabled/etc, do not touch) [21:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:03] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2294057 (10TerraCodes) [21:35:59] (03PS1) 10Dzahn: add Letsencrypt cert for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) [21:36:44] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [21:36:54] (03CR) 10Yuvipanda: "I would suggest going with #2." [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [21:37:22] (03PS2) 10Dzahn: add Letsencrypt cert/config for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) [21:37:31] mutante: <3 [21:38:34] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2294108 (10Dzahn) @volans could you take a look maybe? [21:38:50] mutante: how do you suppose we test it? Just backup old configs, run puppet, and revert manually if it doesn't work? [21:39:13] YuviPanda: pretty much that, yes [21:39:30] mutante: do you want me to add you to that project? [21:39:31] we could prepare a bash scripts that reverts it [21:39:38] yes, ok [21:39:46] doing now [21:39:48] and let's get a +1 from brandon [21:39:54] that is only the second time it's used after rt [21:39:58] which was the test [21:40:13] and more than one subject name [21:40:23] (03CR) 10Krinkle: Fix exceptionmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249905 (owner: 10MaxSem) [21:41:31] (03PS2) 10Rush: Glance: Fix the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/288621 (owner: 10Andrew Bogott) [21:41:37] (03CR) 10Gehel: "Since I'm already in Ops, I probably don't need specific WDQS access. I am in a few other groups, from the time when I was not a member of" [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki) [21:41:56] (03CR) 10Rush: [C: 031] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/288621 (owner: 10Andrew Bogott) [21:42:47] mutante: I've added you now :) it's relic.toolserver-legacy.eqiad.wmflabs I think [21:43:00] (03CR) 10Rush: [C: 04-1] "yup thanks gehel. we have historically discouraged overlay groups that add no real extra perms to ops folks as bad practice." [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki) [21:43:12] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2294113 (10yuvipanda) a:05yuvipanda>03Dzahn nope, Dzahn is awesomer :) [21:43:29] YuviPanda: alright [21:43:32] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2294115 (10mmodell) >>! In T133211#2275973, @faidon wrote: > I think having a flat service deplo... [21:52:25] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/: Fixes for cross-wiki notifications deployment fallout (duration: 00m 38s) [21:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:39] (03PS3) 10Yuvipanda: Add 'hostautomounter' admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288600 [21:55:05] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:51] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/Echo.php: Bump cache version now that cache pollution is hopefully fixed (duration: 00m 25s) [21:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:20] anyone know an easy way to dump a table of data from a logstash panel to json? [21:59:46] intercept the ajax call [21:59:53] :) [22:00:41] or figure out how to use https://github.com/bd808/ggml to make the query [22:01:09] I've done some data mining in the past with that over an ssh tunnel to one of the logstash boxes [22:01:28] (03PS1) 10Hashar: admin: utility for a group/user matrix [puppet] - 10https://gerrit.wikimedia.org/r/288711 (https://phabricator.wikimedia.org/T135187) [22:01:41] there's a curl dump under inspect in the UI that kinda works [22:01:50] but limits the output to only a few data rows, hmmmmm [22:02:10] here's a ggml usage example -- https://phabricator.wikimedia.org/P2544 [22:02:32] (03CR) 10jenkins-bot: [V: 04-1] admin: utility for a group/user matrix [puppet] - 10https://gerrit.wikimedia.org/r/288711 (https://phabricator.wikimedia.org/T135187) (owner: 10Hashar) [22:03:39] I'd run it for you but I have to leave to run some errands [22:04:45] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:04:53] I found a way :) [22:05:14] it's ugly, but I can use chrome's dev console to inspect the html elements after the data is loaded, and have it copy the table to my paste buffer :) [22:06:45] (03PS2) 10Hashar: admin: utility for a group/user matrix [puppet] - 10https://gerrit.wikimedia.org/r/288711 (https://phabricator.wikimedia.org/T135187) [22:10:17] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2294222 (10Smalyshev) [22:10:19] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#2294223 (10Smalyshev) [22:10:21] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2294221 (10Smalyshev) 05Open>03Resolved [22:22:30] (03CR) 10Smalyshev: "one thing is _may_ have to do with is notifications. Not sure if it's the same groups or not, but wdqs-admins is used for some alerts, and" [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki) [22:23:10] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2294261 (10Smalyshev) [22:23:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#2294259 (10Smalyshev) 05Open>03Resolved a:03Smalyshev [22:26:25] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2294295 (10Smalyshev) [22:28:05] (03CR) 10Rush: "Nah, they shouldn't be tied at all." [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki) [22:34:29] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/: Log failures to fetch cross-wiki notifications (duration: 00m 41s) [22:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:04] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Flow/includes/Notifications/Formatter.php: Fix fatal for old Flow notifications (duration: 00m 26s) [22:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:43] (03CR) 10Yuvipanda: [C: 032 V: 032] Add toollabs base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/288464 (https://phabricator.wikimedia.org/T134748) (owner: 10Yuvipanda) [22:46:00] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2294355 (10BBlack) Announcement email (finally) sent! The cutoff dates/process are: * 2016-06-12 - We'll randomly reject 10% of insecure POST with "403 - Inse... [22:50:08] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/includes/api/ApiEchoNotifications.php: More logging (duration: 00m 25s) [22:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:51] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/includes/api/ApiEchoNotifications.php: Logging live hack (duration: 00m 31s) [22:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:26] (03PS1) 10Dzahn: cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) [22:57:39] (03PS2) 10Dzahn: cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) [22:59:04] (03PS3) 10Dzahn: cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) [22:59:07] andre__afk: when you're back, can you take a look at https://phabricator.wikimedia.org/T135290? [23:01:27] (03PS1) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288719 [23:04:07] (03PS1) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 [23:04:21] (03CR) 10BBlack: [C: 031] cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [23:04:29] (03PS2) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) [23:04:44] (03CR) 10Dzahn: [C: 04-2] "not yet, just preparing" [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [23:05:23] (03CR) 10Dzahn: [C: 031] cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [23:06:11] RoanKattouw, are you guys aware of Invalid parameter for message "notification-body-edit-user-talk-with-section": a:1:{s:9:"plaintext";N;} in /srv/mediawiki/php-1.28.0-wmf.1/includes/Message.php [23:06:11] on line 1103 ? [23:06:11] (03CR) 10BBlack: [C: 04-1] "The subjects lists needs to be comma-separated inside the string, rather than whitespace." [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [23:07:47] MaxSem: Yes, I'm aware, haven't had a chance to investigate things yet because worse things are broken [23:08:01] :O [23:08:49] (03PS3) 10Dzahn: add Letsencrypt cert/config for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) [23:10:44] (03CR) 10Dzahn: [C: 032] cache::misc: add director for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/288718 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [23:29:34] (03PS1) 10Dzahn: Revert "ununpentium: temp remove the RT role" [puppet] - 10https://gerrit.wikimedia.org/r/288727 [23:30:15] (03PS2) 10Dzahn: Revert "ununpentium: temp remove the RT role" [puppet] - 10https://gerrit.wikimedia.org/r/288727 [23:32:53] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2294401 (10Dzahn) which instance/host is the one that actually runs it? [23:34:35] (03CR) 10Dzahn: [C: 032] Revert "ununpentium: temp remove the RT role" [puppet] - 10https://gerrit.wikimedia.org/r/288727 (owner: 10Dzahn) [23:36:15] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Puppet last ran 21 hours ago [23:36:37] na, it just finished [23:38:16] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:38:43] MaxSem: MaxSem: [23:38:51] MaxSem: https://gerrit.wikimedia.org/r/288728 [23:47:35] (03CR) 10BBlack: [C: 031] add Letsencrypt cert/config for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [23:54:16] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/includes/api/ApiEchoNotifications.php: Do not reuse CentralAuth tokens (duration: 00m 25s) [23:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:17] (03Abandoned) 10JanZerebecki: admin: add gehel to wdqs-admin [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki) [23:59:17] (03CR) 10Dzahn: "notifications are unaffacted by admin groups. they are separate groups just defined in icinga/nagios. it just happens to have that same na" [puppet] - 10https://gerrit.wikimedia.org/r/288706 (owner: 10JanZerebecki)