[00:01:42] <wikibugs>	 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10BBlack)
[00:42:20] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/AbuseFilter/includes/Views/AbuseFilterViewExamine.php: fix comment stuff (duration: 00m 46s)
[00:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:44] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590296 (10demon) >>! In T175288#3589585, @RobH wrote: > I'm going to install with jessie, like tin and naos both presently have...
[00:50:46] <wikibugs>	 (03CR) 10Chad: [C: 031] "This can land whenever." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox)
[00:57:18] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590306 (10bd808) Moving the deploy server to stretch should probably wait for a stretch HHVM build shouldn't it?
[02:03:27] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[02:06:57] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycache: try to repair it on query. Default database: dewiki. [Query snipped]
[02:12:48] <wikibugs>	 (03PS5) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264)
[03:34:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.59 seconds
[03:40:03] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590434 (10demon) Probably a decent idea. Ignore my idea.
[03:46:27] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycache: try to repair it on query. Default database: dewiki. [Query snipped]
[03:56:21] <yannf>	 https://phabricator.wikimedia.org/T175304 <-  this is a serious issue
[04:09:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 131.62 seconds
[04:43:38] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[04:45:37] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.007 second response time
[04:53:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89790.55 seconds
[05:06:11] <Esther>	 yannf: Have you found the culprit?
[05:16:27] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational
[05:37:38] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[05:38:38] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[05:57:47] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:57:57] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:57:57] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:57:57] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:57:58] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:57:58] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:57:58] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:07] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:08] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:08] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:27] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:28] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:37] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:37] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:37] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:37] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:58:37] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:58:47] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:59:27] <icinga-wm>	 RECOVERY - jmxtrans on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[06:01:17] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:17] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:17] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:17] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[06:01:37] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:37] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:37] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:38] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:38] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:38] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:47] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:47] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:57] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:01:58] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[06:01:58] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:01:58] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:02:07] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:02:07] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[06:02:07] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:03:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590510 (10elukey) ``` elukey@kafka-jumbo1001:/usr/share/jmxtrans$ source /etc/default/jmxtrans elukey@kafka-jumbo1001:/usr/share/jmx...
[06:03:35] <elukey>	 gehel: ---^ (just as FYI)
[06:32:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "She's already in ldap_only_users, so that needs to be removed when getting shell access in addition." [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn)
[06:34:40] <wikibugs>	 10Operations, 10Ops-Access-Requests: Change  prod uid  from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3590515 (10MoritzMuehlenhoff) a:03RobH Assigning to Rob, who created both the LDAP and shell user access. This also triggers a warning in the daily account...
[06:35:50] <wikibugs>	 (03PS1) 10Elukey: jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992)
[06:39:45] <wikibugs>	 (03CR) 10Reedy: "We could use symlinks instead... But if this is going to be shortlived...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn)
[06:41:33] <wikibugs>	 (03CR) 10DCausse: [V: 032 C: 032] adding Priority: optional to metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376568 (owner: 10Gehel)
[06:55:06] <wikibugs>	 (03PS1) 10Ema: prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665
[06:56:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Also print amount of hosts not requiring a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203 (owner: 10Muehlenhoff)
[06:58:04] <wikibugs>	 (03PS2) 10Elukey: jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992)
[06:58:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250
[07:01:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff)
[07:02:48] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272
[07:03:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 (owner: 10Muehlenhoff)
[07:04:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277
[07:06:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277 (owner: 10Muehlenhoff)
[07:10:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213
[07:11:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213 (owner: 10Muehlenhoff)
[07:11:33] <wikibugs>	 (03PS2) 10Ema: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack)
[07:13:13] <wikibugs>	 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3590555 (10Gilles)
[07:13:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212
[07:13:44] <wikibugs>	 (03CR) 10Volans: [C: 032] "Trivial fix, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/374132 (owner: 10Volans)
[07:15:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212 (owner: 10Muehlenhoff)
[07:15:36] <wikibugs>	 (03Merged) 10jenkins-bot: CLI: fix --version option [software/cumin] - 10https://gerrit.wikimedia.org/r/374132 (owner: 10Volans)
[07:16:36] <wikibugs>	 (03CR) 10Volans: [C: 032] "Trivial fix, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/374133 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans)
[07:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: Fix data_files installation directory [software/cumin] - 10https://gerrit.wikimedia.org/r/374133 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans)
[07:18:32] <wikibugs>	 (03PS2) 10Volans: Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911)
[07:18:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265
[07:19:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265 (owner: 10Muehlenhoff)
[07:21:10] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@e6aeeeb] (dev-cluster): (no justification provided)
[07:21:22] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@e6aeeeb] (dev-cluster): (no justification provided) (duration: 00m 11s)
[07:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247
[07:22:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247 (owner: 10Muehlenhoff)
[07:23:44] <wikibugs>	 (03CR) 10Volans: [C: 032] Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) (owner: 10Volans)
[07:24:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244
[07:24:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590584 (10elukey) Applied the following to all the nodes to remove the placeholder logical volumes:  ``` root@kafka-jumbo1001:/home/...
[07:25:10] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): (no justification provided)
[07:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:30] <wikibugs>	 (03Merged) 10jenkins-bot: Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) (owner: 10Volans)
[07:25:47] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational
[07:26:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244 (owner: 10Muehlenhoff)
[07:32:11] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@0d39acf] (dev-cluster): (no justification provided) (duration: 07m 02s)
[07:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:53] <wikibugs>	 10Operations, 10Analytics, 10netops, 10User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3590602 (10elukey) The next step is to design and add the `analytics-in6` filter to cr1/cr2 eqiad, but I would wait for kafka1012-1022 to be decommissioned before that. Those h...
[07:46:37] <icinga-wm>	 RECOVERY - Restbase root url on restbase-dev1004 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.034 second response time
[07:48:59] <wikibugs>	 (03Abandoned) 10Volans: wmf-auto-reimage: support mutiple conftool roles [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) (owner: 10Volans)
[07:51:50] <Amir1>	 https://phabricator.wikimedia.org/T174269
[07:53:20] <moritzm>	 !log installing ruby2.3 security updates
[07:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:38] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational
[08:07:58] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[08:11:10] <mobrovac>	 !log restbase enabled back puppet in the dev cluster - T169940
[08:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:23] <stashbot>	 T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940
[08:11:55] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster - T169940
[08:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:06] <moritzm>	 !log installing libgd2 security updates
[08:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "The rule names are correct, syntax is missing ()" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376665 (owner: 10Ema)
[08:20:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590648 (10elukey) I could be wrong but from cr1/cr2 eqiad the hosts seem to be in the Analytics VLAN, and they shouldn't be:  ``` el...
[08:21:30] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590649 (10jcrespo) > Already answered above. CRIT on Icinga doesn't mean "can cause an outage". But it's also easy to change to WARN.  Thank you; I was in some cases ignoring W...
[08:23:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590651 (10elukey) @Ottomata: let's also remember to whitelist the jumbo IPs in the Analytics VLAN firewall rules, otherwise hosts li...
[08:26:19] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590653 (10jcrespo) cc @Faidon ^
[08:42:18] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:42:20] <wikibugs>	 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3590669 (10fgiunchedi) >>! In T170120#3587521, @mobrovac wrote: > On the metrics side, we standardised on the StatsD format, but +1 on using Prometheus...
[08:43:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249
[08:44:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249 (owner: 10Muehlenhoff)
[08:51:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245
[08:52:30] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245
[08:53:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245 (owner: 10Muehlenhoff)
[08:54:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980
[08:55:03] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590701 (10jcrespo) > Were security considerations taken into account?  Security considerations with pinging on a public channel which servers security patches (in the form of m...
[08:55:09] <wikibugs>	 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10fgiunchedi) Those messages are due to acpi power meter, which we blacklist as of https://gerrit.wikimedia.org/r/#/c/356422/. A reboot should make the message go away.
[08:56:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274
[08:57:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274 (owner: 10Muehlenhoff)
[09:00:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376672
[09:01:38] <moritzm>	 !log installing remaining gnutls updates from jessie point release
[09:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:00] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster, take #2 - T169940
[09:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:14] <stashbot>	 T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940
[09:08:01] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster, take #2 - T169940 (duration: 01m 01s)
[09:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:17] <icinga-wm>	 RECOVERY - Restbase root url on restbase-dev1005 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.014 second response time
[09:09:07] <icinga-wm>	 RECOVERY - Restbase root url on restbase-dev1006 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.017 second response time
[09:09:15] <moritzm>	 !log installing libonig security updates
[09:09:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:51] <wikibugs>	 (03CR) 10Gehel: "Wow! That's a lot of GC tuning! The current upstream has gone back to a much saner default (https://github.com/jmxtrans/jmxtrans/blob/mast" [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey)
[09:15:57] <wikibugs>	 (03CR) 10Gehel: [C: 031] "Another note, the default heap / perm size (512/384) are probably far too large for our use case (but I have not looked at any data, just " [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey)
[09:16:37] <wikibugs>	 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3590739 (10fgiunchedi) FTR: username is root and password is the management password.  I checked the webui for ps1-a2-eqiad and the current/voltage readings are...
[09:16:56] <wikibugs>	 (03PS1) 10Gilles: Upgrade to 1.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/376674 (https://phabricator.wikimedia.org/T173580)
[09:20:25] <wikibugs>	 (03PS1) 10Gilles: Thumbor: enable new MAX_ANIMATED_GIF_AREA option [puppet] - 10https://gerrit.wikimedia.org/r/376676 (https://phabricator.wikimedia.org/T173580)
[09:24:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey)
[09:25:08] <moritzm>	 !log restarting apache on tegmen/einsteinium to pick up security update
[09:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:20] <wikibugs>	 10Operations, 10DC-Ops: Review and fix PDU settings for syslog/ntp servers - https://phabricator.wikimedia.org/T175341#3590787 (10fgiunchedi)
[09:43:44] <wikibugs>	 10Operations, 10DC-Ops: Review and fix PDU settings for syslog/ntp/email servers - https://phabricator.wikimedia.org/T175341#3590843 (10fgiunchedi)
[09:44:49] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3590846 (10Gilles)
[09:46:41] <wikibugs>	 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3590850 (10Joe) Running the mdadm command on one host caused a re-election to happen. It seems likely we found the culprit, so now I'm going to run the command at the same time on first two hos...
[09:48:00] <_joe_>	 !log running `/usr/share/mdadm/checkarray --cron --all --idle --quiet` on conf2001 and conf2003 trying to reproduce consensus issues in T162013
[09:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:13] <stashbot>	 T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013
[09:58:14] <_joe_>	 !log running the same command on conf2002 too
[09:58:18] <volans>	 ack
[09:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:46] <_joe_>	 volans: this time I'm not seeing leader loss or anything, uhm
[09:59:18] <_joe_>	 so let's see with all three active
[09:59:28] <volans>	 which md drive it's checking now?
[09:59:40] <_joe_>	 it already got to md2
[09:59:47] <_joe_>	 on conf2001 and 2003
[10:00:06] <yannf>	 Esther, not fixed yet
[10:00:11] <yannf>	 https://phabricator.wikimedia.org/T175304 <-  this is a serious issue
[10:00:13] <_joe_>	 and we have 15 to 10% iowait there
[10:00:29] <volans>	 let's see
[10:01:17] <icinga-wm>	 RECOVERY - Check systemd state on restbase2003 is OK: OK - running: The system is fully operational
[10:06:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688
[10:07:44] <volans>	 !log testing wmf-auto-reimage on mc1001 T166300 T164341
[10:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:57] <stashbot>	 T166300: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300
[10:07:58] <stashbot>	 T164341: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341
[10:08:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376672 (owner: 10Muehlenhoff)
[10:09:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590908 (10elukey) @Ottomata: I merged https://gerrit.wikimedia.org/r/#/c/376663 but I then realized that master/debian branches are...
[10:09:38] <_joe_>	 volans: here we are
[10:09:42] <_joe_>	 consensus lost
[10:09:53] <_joe_>	 !log etcd cluster lost consensus T162013
[10:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:06] <stashbot>	 T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013
[10:10:14] <_joe_>	 this will page people, sorry for that
[10:10:54] <volans>	 _joe_: disabled active checks, so no page ;)
[10:11:14] <_joe_>	 volans: oh ok, I would've loved to see after how much time we would notice this
[10:12:18] <volans>	 3 checks
[10:12:29] <_joe_>	 !log stopped all md devices syncs on conf200* machines
[10:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:49] <_joe_>	 and as soon as I did that, consensus is back
[10:13:35] <_joe_>	 ok, so we can just stagger the resyncs across different dates
[10:13:44] <_joe_>	 or reduce the speed of the resync via sysctl
[10:13:53] <_joe_>	 actually I think I'll do that
[10:14:12] <volans>	 md checks should be ideally scattered inside a cluster
[10:14:14] <volans>	 like puppet runs
[10:14:33] <_joe_>	 volans: still, do you agree we might even try to just reduce the resync speed?
[10:14:38] <volans>	 sure
[10:15:04] <_joe_>	 I'll post these observations to the ticket. Again, great catch
[10:15:30] <volans>	 I was thinking to use the last 2 digit of the hostname (for names like nameXXXX) modulo 28 as the date of the month to do that :-P
[10:15:40] <volans>	 _joe_: re-enabling active checks
[10:15:42] <_joe_>	 (I restarted replication btw)
[10:15:46] <_joe_>	 yeah please do
[10:15:57] <_joe_>	 I'm going afk for a bit, but everything is ok no
[10:15:59] <_joe_>	 *now
[10:16:41] <p858snake>	 <yannf> https://phabricator.wikimedia.org/T175304 <-  this is a serious issue  < Reedy / legoktm / no_justification willing to do a weekend deploy?
[10:18:01] <yannf>	 this prevents most works on all Wikisource
[10:18:10] <hashar>	 grgmbmbm
[10:18:31] <hashar>	 yannf: p858snake: do you have a way to reproduce?   I dont mind deploying it now
[10:19:36] <hashar>	 and if you have the X-Wikimedia-Debug browser extension https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions   you would even be able to test it  :]
[10:20:03] <hashar>	 also apparently the patch got merged on beta, so the issue should no more show up on https://en.wikisource.beta.wmflabs.org/wiki/Main_Page
[10:20:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689
[10:21:24] <yannf>	 hashar, you need a page in the Page: namespace
[10:21:44] <yannf>	 i.e. so probably also an index
[10:23:42] <yannf>	 hashar, the buttons should appear here at the bottom https://fr.wikisource.org/w/index.php?title=Page:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol12.djvu/1&action=edit&redlink=1
[10:23:58] <yannf>	 there is none right now
[10:24:07] <hashar>	 ok
[10:24:52] <hashar>	 +2 ed 
[10:25:02] <hashar>	 and will deploy on the debug servers once the patch has merged
[10:25:14] <hashar>	 then try the above link using the X-Wikimedia-Debug browser extension
[10:26:16] <yannf>	 https://en.wikisource.beta.wmflabs.org/wiki/Page:Dictionary_of_National_Biography_volume_51.djvu/397 here
[10:26:33] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689 (owner: 10Muehlenhoff)
[10:27:13] <yannf>	 not sure why, the index doesn't show up properly on Beta
[10:28:43] <Amir1>	 I can deploy 
[10:29:30] <Amir1>	 hashar: you are already on it, awesome :)
[10:30:02] <hashar>	 ok it got merged
[10:31:43] <hashar>	 p858snake: yannf: Amir1: deployed on mwdebug1001
[10:33:48] <Amir1>	 I can't see it
[10:33:54] <Amir1>	 tried several times
[10:34:18] <hashar>	 I can 
[10:34:24] <hashar>	 on mwdebug1001 the button show up 
[10:35:02] <logmsgbot>	 !log hashar@tin Synchronized php-1.30.0-wmf.17/extensions/ProofreadPage: Restore page status buttons - T175304 (duration: 00m 50s)
[10:35:09] <hashar>	 yannf: try again ? :]
[10:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:15] <stashbot>	 T175304: Page status buttons do not appear any more in Wikisource - https://phabricator.wikimedia.org/T175304
[10:35:20] <hashar>	 Amir1: I used https://fr.wikisource.org/w/index.php?title=Page:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol12.djvu/1&action=edit&redlink=1
[10:35:37] <Amir1>	 I can see it now
[10:36:55] <hashar>	 I marked it as fixed up
[10:39:01] <yannf>	 yeah, good now
[10:39:05] <yannf>	 thanks a lot!
[10:47:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689 (owner: 10Muehlenhoff)
[10:48:58] <wikibugs>	 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591019 (10Joe) Result of the latest experiment:  - Consensus was lost as soon as resync reached md2 on all servers - `iowait` rose on all servers above 10%, see https://grafana.wikimedia.org/d...
[10:59:57] <wikibugs>	 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3591039 (10Joe) > I am a bit more concerned about performance and reliability implications of adding indirections in the data path itself. TLS is suppo...
[11:01:44] <_joe_>	 win 27
[11:01:54] <_joe_>	 even with a slash in front
[11:06:16] <wikibugs>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3591084 (10phuedx)
[11:07:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove remaining salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376702
[11:12:33] <icinga-wm>	 PROBLEM - Host mc1001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:14:16] <moritzm>	 mc1001 is the old server batch, might be decom in progress
[11:15:23] <icinga-wm>	 RECOVERY - Host mc1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[11:16:58] <volans>	 that's me
[11:17:01] <volans>	 see SAL
[11:17:05] <volans>	 testing the reimage script
[11:17:17] * volans wonders how can be in icinga though
[11:25:03] <icinga-wm>	 PROBLEM - Host mc1001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:28:23] <icinga-wm>	 RECOVERY - Host mc1001 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[11:30:55] <wikibugs>	 (03CR) 10Volans: "See comment inline" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff)
[11:53:31] <wikibugs>	 (03PS1) 10Mobrovac: Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281)
[11:56:24] <icinga-wm>	 PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:34] <icinga-wm>	 PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:34] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:35] <icinga-wm>	 PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:45] <icinga-wm>	 PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:54] <icinga-wm>	 PROBLEM - puppet last run on dumpsdata1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:56:54] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:04] <icinga-wm>	 PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:04] <icinga-wm>	 PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:05] <icinga-wm>	 PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:14] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:24] <icinga-wm>	 PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:24] <icinga-wm>	 PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:25] <icinga-wm>	 PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:25] <icinga-wm>	 PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:34] <icinga-wm>	 PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:35] <icinga-wm>	 PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:57:35] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:02:18] <wikibugs>	 (03CR) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff)
[12:02:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688
[12:08:03] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3591270 (10mobrovac)
[12:12:07] <volans>	 akosiaris: FYI puppetdb was restarted on nitrogen by the OOM-killer ^^^
[12:12:22] <volans>	 (killed by the OOM-killer, restarted by systemd ofc ;) )
[12:14:32] <akosiaris>	 ah nice
[12:16:28] <volans>	 seems that we still don't have a stable memory config for it :(
[12:21:56] <akosiaris>	 why is OOM showing up though
[12:22:14] <akosiaris>	 the box still had 5.33GB in cached memory
[12:22:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Please remove the system::role declaration. Otherwise, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[12:22:28] <wikibugs>	 (03CR) 10Volans: "Already nicer, I think we can save another couple of level of indentation ;)" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff)
[12:22:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comment. Otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[12:24:34] <icinga-wm>	 RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[12:24:44] <icinga-wm>	 RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[12:24:44] <icinga-wm>	 RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[12:24:54] <icinga-wm>	 RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[12:24:54] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[12:25:04] <icinga-wm>	 RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[12:25:06] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[12:25:14] <icinga-wm>	 RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[12:25:14] <icinga-wm>	 RECOVERY - puppet last run on dumpsdata1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[12:25:24] <icinga-wm>	 RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[12:25:24] <icinga-wm>	 RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[12:25:25] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[12:25:34] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[12:25:44] <icinga-wm>	 RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[12:25:45] <icinga-wm>	 RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[12:25:54] <icinga-wm>	 RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[12:25:54] <icinga-wm>	 RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[12:25:54] <icinga-wm>	 RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[12:29:47] <wikibugs>	 (03PS2) 10Mobrovac: Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281)
[12:30:09] <wikibugs>	 (03CR) 10Mobrovac: Add the cp-jobqueue profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[12:34:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think we probably want it to become CRIT at some point." [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn)
[12:35:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[12:36:05] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes: Add a few recommended admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374974 (https://phabricator.wikimedia.org/T170119)
[12:36:12] <_joe_>	 !log reduced raid resync max speed of raid devices on conf200* to 20000
[12:36:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Add a few recommended admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374974 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris)
[12:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:29] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111)
[12:36:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111) (owner: 10Alexandros Kosiaris)
[12:36:52] <_joe_>	 !log restarting a check of the md2 devices on conf200* T162013 after reducing sync speed
[12:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:04] <stashbot>	 T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013
[12:40:08] <wikibugs>	 (03PS1) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494)
[12:40:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[12:52:16] <wikibugs>	 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591347 (10Joe) Reducing the sync speed manually did the job, so we can just puppetize this.
[12:57:44] <icinga-wm>	 PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cp-jobqueue]
[12:57:55] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cp-jobqueue]
[12:59:45] <volans>	 mobrovac: ^^^
[13:00:03] <mobrovac>	 damn
[13:00:50] <wikibugs>	 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3591354 (10faidon)
[13:04:01] <mobrovac>	 volans: i can't do/see anything on tin (no perms). can you take a look at what does the puppet log say there?
[13:05:15] <volans>	 mobrovac: exec of '/usr/bin/git -c core.sharedRepository=group clone --recurse-submodules https://gerrit.wikimedia.org/r/p/mediawiki/services/cp-jobqueue.git /srv/deployment/cp-jobqueue/cp-jobqueue' returned 128
[13:05:32] <wikibugs>	 10Operations, 10Mail: Split MXes into inbound and outbound - https://phabricator.wikimedia.org/T175362#3591376 (10faidon)
[13:05:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:06:19] <mobrovac>	 volans: uf ok, so puppet first creates the dir and then tries to clone there, which fails because the dir exists
[13:06:31] <mobrovac>	 scap seems to assume that a repo always has a slash in it
[13:06:33] <volans>	 lol
[13:06:34] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time
[13:06:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[13:06:53] <volans>	 _joe_: ^^^
[13:06:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 80330 bytes in 0.233 second response time
[13:07:07] <volans>	 or moritzm ^^^
[13:07:21] <mobrovac>	 _joe_: cp-jobqueue will not do it, we'll have to have a separate deploy repo for that; how about change-propagation/jobqueue-deploy ?
[13:07:24] <_joe_>	 looking
[13:07:32] <_joe_>	 mobrovac: later, sorry
[13:07:34] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.128 second response time
[13:07:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 80329 bytes in 0.116 second response time
[13:09:01] <_joe_>	 volans: same issue we had other times
[13:09:14] <volans>	 ack
[13:14:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: etcd: limit RAID resync speed if on linux software raid [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013)
[13:15:42] <paravoid>	 _joe_: our abstractions are nice, aren't they
[13:15:57] <paravoid>	 a fact, a sysctl definition
[13:19:07] <_joe_>	 paravoid: indeed!
[13:19:55] <paravoid>	 if the raid resync is affecting etcd's consensus, maybe it's too sensitive?
[13:20:14] <_joe_>	 yeah but the servers got to have ~ 20% iowait
[13:20:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a new debdeploy command query_version [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715
[13:20:52] <_joe_>	 and I still have to understand why that doesn't happen in eqiad
[13:21:23] <_joe_>	 paravoid: I'm more comfortable reducing the raid sync speed than with fiddling with RAFT timeouts on a friday afternoon, too :)
[13:21:46] <paravoid>	 nod
[13:22:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a new debdeploy command query_version [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715
[13:28:02] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, compiler for reference available here:" [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto)
[13:28:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: limit RAID resync speed if on linux software raid [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto)
[13:29:07] <wikibugs>	 (03CR) 10Gehel: "LGTM in general. I'd like to find someone who has some time to check the state of maps codfw a bit more before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack)
[13:34:54] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[13:35:13] <_joe_>	 uhm
[13:35:40] <_joe_>	 mobrovac: it worked ^^
[13:35:46] <_joe_>	 "eventual consistency :P
[13:39:04] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:39:53] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10procurement: cost estimate for two prometheus+grafana servers for fundraising - https://phabricator.wikimedia.org/T175364#3591415 (10Jgreen)
[13:40:22] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591441 (10Joe) 05Open>03Resolved a:03Joe
[13:51:44] <icinga-wm>	 PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:57:33] <wikibugs>	 (03CR) 10Muehlenhoff: Add ferm service for rpc.statd on labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff)
[14:00:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Add ferm service for rpc.statd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136)
[14:02:15] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10mark) @Dzahn: thanks for working on this! The current implementation with WARNINGs and CRITICAL at higher threshold seems like a good start. If needed, we can readjus...
[14:06:35] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:08:01] <wikibugs>	 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591573 (10Suhadakashter)
[14:08:57] <wikibugs>	 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10Suhadakashter)
[14:09:28] <wikibugs>	 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591614 (10Reedy) 05duplicate>03Open
[14:11:43] <wikibugs>	 (03PS2) 10Ema: prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665
[14:12:02] <wikibugs>	 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591637 (10fgiunchedi) I checked with @mark and the current readings are per-phase, since we're using "3 Wye" phase configuration and a server will consume from...
[14:14:54] <wikibugs>	 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591653 (10fgiunchedi) The graphs for codfw and eqiad are also reported here (stacked) https://grafana.wikimedia.org/dashboard/db/site-power-usage
[14:15:13] <wikibugs>	 10Operations: Integrate stretch 9.1 point release - https://phabricator.wikimedia.org/T171453#3591654 (10MoritzMuehlenhoff) These are fully rolled out: nagios-nrpe gnutls28 perl
[14:15:39] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:16:26] <wikibugs>	 (03Restored) 10Andrew Bogott: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:17:16] <wikibugs>	 (03CR) 10Andrew Bogott: "Chase, did you mean to -1 https://gerrit.wikimedia.org/r/#/c/375941/ which actually is redundant?" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:20:14] <icinga-wm>	 RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[14:20:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, after the merge the metric will start to be generated. The dashboards can be adjusted in a week or so when the new metrics have some" [puppet] - 10https://gerrit.wikimedia.org/r/376665 (owner: 10Ema)
[14:22:58] <wikibugs>	 (03CR) 10Rush: "Yep!" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:23:17] <wikibugs>	 (03CR) 10Rush: [C: 04-1] "Already side merged this change :D" [puppet] - 10https://gerrit.wikimedia.org/r/375941 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:25:35] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Oops I forgot to report back on this change.  I did the upgrade on 8/31:" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar)
[14:25:44] <icinga-wm>	 RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[14:26:34] <icinga-wm>	 PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[systemd-timesyncd]
[14:26:45] <wikibugs>	 (03PS2) 10Rush: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:27:17] <wikibugs>	 (03CR) 10Hashar: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/374999 (owner: 10Hashar)
[14:27:41] <wikibugs>	 (03PS1) 10Mobrovac: ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281)
[14:33:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600)
[14:34:21] <wikibugs>	 (03PS1) 10Hashar: nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882)
[14:34:42] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nova: make default 'nova' availability-zone explicit [puppet] - 10https://gerrit.wikimedia.org/r/375941 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:34:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[14:35:06] <wikibugs>	 (03CR) 10Hashar: "This week I have migrated a lot of php 5.5 jobs from trusty to jessie :]  So we need less trusty instances." [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[14:39:16] <wikibugs>	 (03PS2) 10Andrew Bogott: nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[14:40:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[14:42:50] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[14:42:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac)
[14:50:22] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3591719 (10mobrovac)
[14:52:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600)
[14:52:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: dnsrecursor: Enable PowerdnsRecursor diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/376732 (https://phabricator.wikimedia.org/T169600)
[14:55:02] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3591738 (10mobrovac) The repo has been set up and cloned on `tin` and the `ops/puppet` profile created and merged. Left to do is to...
[14:55:05] <icinga-wm>	 RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[14:56:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] [WMF] jessie: tweak build dependencies [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[14:56:54] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] [WMF] jessie: tweak build dependencies [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[14:57:24] <icinga-wm>	 PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cpjobqueue/deploy]
[14:58:08] <volans>	 !log testing wmf-auto-reimage also on mc1002 T166300 T164341
[14:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:23] <stashbot>	 T166300: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300
[14:58:24] <stashbot>	 T164341: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341
[15:01:24] <icinga-wm>	 RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[15:02:25] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4021.*
[15:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Force the ssh key to be used by scap [software/librenms] - 10https://gerrit.wikimedia.org/r/376734 (https://phabricator.wikimedia.org/T172333)
[15:04:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600) (owner: 10Alexandros Kosiaris)
[15:04:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] dnsrecursor: Enable PowerdnsRecursor diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/376732 (https://phabricator.wikimedia.org/T169600) (owner: 10Alexandros Kosiaris)
[15:04:50] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[15:05:09] <akosiaris>	 ah
[15:05:12] <volans>	 _joe_: still testing?
[15:05:13] <_joe_>	 uhm
[15:05:14] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[15:05:17] <akosiaris>	 _joe_: is that you?
[15:05:23] <_joe_>	 volans: actually yes, the resync is ongoing
[15:05:28] <akosiaris>	 a ok
[15:05:30] <_joe_>	 but this wasn't expected tbh
[15:05:44] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:05:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Force the ssh key to be used by scap [software/librenms] - 10https://gerrit.wikimedia.org/r/376734 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris)
[15:06:55] <logmsgbot>	 !log akosiaris@tin Started deploy [librenms/librenms@5554213]: Testing new scap keyholder_key config
[15:06:59] <logmsgbot>	 !log akosiaris@tin Finished deploy [librenms/librenms@5554213]: Testing new scap keyholder_key config (duration: 00m 04s)
[15:07:05] <_joe_>	 it already recovered btw
[15:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:14] <_joe_>	 uhm, nope
[15:07:15] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[15:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:27] <_joe_>	 ok, time to stop the raid resync again
[15:07:44] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[15:08:00] <icinga-wm>	 RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time
[15:08:39] <akosiaris>	 btw, the fact that a RAID resync would be killing the etcd replication is a "who would have thought" moment
[15:09:35] <volans>	 heheh
[15:14:05] <wikibugs>	 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591796 (10RobH) So Chris flipped around SDA and SDB but the installer still gives an error for SDA not being present.  This is likely...
[15:14:14] <wikibugs>	 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591799 (10RobH)
[15:15:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333)
[15:16:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Not urgent, but it would be nice to revert it at some point. Let me know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris)
[15:21:19] <wikibugs>	 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591837 (10Johan) A couple of discussions: [[ https://en.w...
[15:21:21] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591838 (10Joe) I was too optimistic, it appears, in declaring victory. The new resync at reduced speeds still triggered consensus issues. It seems the version of etcd we'...
[15:21:27] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591839 (10Joe) 05Resolved>03Open
[15:35:05] <wikibugs>	 (03PS3) 10Andrew Bogott: labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452
[15:35:12] <andrewbogott>	 godog: lmk when you have a moment?  I have more prometheus questions (or, really, the same prometheus questions)
[15:36:34] <wikibugs>	 (03PS1) 10Hashar: zuul: allow email connection [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414)
[15:51:10] <wikibugs>	 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3591910 (10akosiaris)
[15:51:15] <wikibugs>	 10Operations, 10Diamond, 10Traffic, 10monitoring, and 2 others: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3591908 (10akosiaris) 05Open>03Resolved Patches created and merged. Got a basic dashboard working at https://grafana.wikimedia.org/dashbo...
[15:53:34] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4027.*
[15:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:19] <wikibugs>	 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591915 (10RobH) Chris also reseated it all, still error.
[16:00:44] <wikibugs>	 10Operations, 10monitoring: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3527972 (10akosiaris) >>! In T173427#3553083, @Volans wrote: > An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-pup...
[16:03:02] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.17/includes/revisiondelete/RevDelLogList.php: typofix (duration: 00m 46s)
[16:03:14] <wikibugs>	 (03PS1) 10Chad: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743
[16:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:38] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/AbuseFilter/includes/Views/AbuseFilterViewExamine.php: T175338, followup (duration: 00m 46s)
[16:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:50] <stashbot>	 T175338: Fatal: Wikimedia\Rdbms\DBQueryError on Special:AbuseFilter/examine: "Error: 1054 Unknown column 'rev_id' in 'field list' (10.64.16.191)" - https://phabricator.wikimedia.org/T175338
[16:06:04] <wikibugs>	 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591939 (10BBlack) Thanks! So far, I haven't heard of any...
[16:06:53] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "sounds reasonable to revert it since that was just a workaround for the scap issue which sounds resolved" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris)
[16:13:03] <wikibugs>	 (03CR) 10Dzahn: "oh, thanks for pointing that out. amending" [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn)
[16:15:52] <wikibugs>	 (03PS2) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204)
[16:16:41] <wikibugs>	 (03PS3) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204)
[16:18:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3591956 (10Cmjohnson)
[16:19:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10Cmjohnson) bios is setup, raid is configured to raid 10. switch ports need setup still  1019  ->  b4 ge-4/0/33 1020  -> b7 ge-7/0/13
[16:20:44] <wikibugs>	 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3591959 (10Cmjohnson) @MoritzMuehlenhoff is it okay to resolve this task?
[16:21:04] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "this doesn't give actual access yet - adding to the right group has to be separate anyways for technical reasons and needs to wait until M" [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn)
[16:28:44] <wikibugs>	 (03PS1) 10Dzahn: admins: add rho to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204)
[16:30:11] <godog>	 andrewbogott: heh, I have to run now, we can do next week tho!
[16:30:16] <godog>	 monday that is
[16:30:27] <andrewbogott>	 ok
[16:30:35] <wikibugs>	 (03CR) 10Dzahn: "not yet, but can be done on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn)
[16:34:22] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3591976 (10Dzahn) @Rho The 2 changes above are prepared now but not merged yet. They will create your user account (step 1) and then add it to th...
[16:37:28] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "yea, actually i think that too, we can just set the threshold really high. i just wanted to offer all the options since there was some dis" [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn)
[16:37:43] <wikibugs>	 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591981 (10Johan) To be honest, most of the community is b...
[16:38:20] <wikibugs>	 (03PS4) 10Dzahn: contint: upgrade git on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar)
[16:39:34] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "don't worry, it's not NOW, the upgrade has already happened in the past, see comment from Hashar" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar)
[16:41:57] <wikibugs>	 (03CR) 10Dzahn: "Info: Applying configuration version '1504888877'" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar)
[16:44:05] <wikibugs>	 (03CR) 10Dzahn: "git is already the newest version." [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar)
[16:52:38] <wikibugs>	 (03PS1) 10RobH: dsaez uid update [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220)
[16:53:46] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "renaming diego to dsaez seems right and should fix consistency warning and SWAP access" [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220) (owner: 10RobH)
[16:53:52] <wikibugs>	 (03PS1) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750
[16:54:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (owner: 10ArielGlenn)
[16:57:03] <wikibugs>	 (03CR) 10RobH: [C: 032] dsaez uid update [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220) (owner: 10RobH)
[16:57:45] <wikibugs>	 (03PS1) 10BBlack: [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751
[17:02:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris)
[17:02:53] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3592052 (10Papaul) Hi Papaul,     Here is the dispatch information for service to address system unresponsiveness of the PowerEdge R430 with service tag FXLPND2.     Next business day service dispatch  Service date d...
[17:04:17] <wikibugs>	 (03PS2) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750
[17:04:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn)
[17:06:12] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:07:06] <mutante>	 ^ that's gonna be related to the user renaming
[17:07:08] <mutante>	 and will be temp
[17:07:26] <mutante>	 robh is on it
[17:07:36] <robh>	 yeah, he is logged into that
[17:07:42] <robh>	 and i need to tell him to kill his screen sessions
[17:07:54] <mutante>	 heh, screen sessions
[17:08:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:09:52] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:10:13] <icinga-wm>	 PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:11:53] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[17:12:42] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[17:14:12] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:16:22] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[17:19:29] <wikibugs>	 10Operations, 10Puppet, 10Readers-Web-Backlog (Tracking): Remove references to non-existent mfLazyLoadReferences cookies - https://phabricator.wikimedia.org/T175381#3592074 (10Jdlrobson)
[17:19:48] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3592094 (10RobH)
[17:19:50] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Change  prod uid  from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3592092 (10RobH) 05Open>03Resolved This change caused a wholly expected error, namely the new user had their uid used by the old us...
[17:19:59] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10RobH) 05Open>03Resolved a:03RobH Fixed!
[17:20:12] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[17:21:54] <wikibugs>	 (03PS4) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204)
[17:24:34] <icinga-wm>	 RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[17:28:53] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "let me know if today is good" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox)
[17:31:03] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:32:03] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez]
[17:35:12] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[17:38:42] <icinga-wm>	 PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100%
[17:38:52] <icinga-wm>	 PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106)
[17:40:12] <icinga-wm>	 RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[17:40:32] <icinga-wm>	 RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[17:45:23] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[17:48:38] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris)
[17:51:08] <wikibugs>	 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3592267 (10madhuvishy) 05Open>03Resolved @fgiunchedi Thank you! That seems to have fixed it. Resolving this task. Thanks everyone :)
[17:53:29] <wikibugs>	 (03PS1) 10Chad: Drop PrivateSettings symlink, just include directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762
[17:54:06] <wikibugs>	 (03PS10) 10Chad: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy)
[17:57:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy)
[18:01:05] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3592304 (10EBernhardson) cirrusSearchCheckerJob - basically idempotent. It verifies data in elasticsearch matches mysql, creates new...
[18:10:43] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3592325 (10RHo) Awesome, thanks @Dzahn !
[18:12:01] <wikibugs>	 (03PS1) 10Chad: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767
[18:12:03] <wikibugs>	 (03CR) 10Chad: [C: 032] Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad)
[18:12:56] <wikibugs>	 (03PS1) 10Dzahn: admins: add Kai Nissen (knissen) to LDAP (nda) users [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046)
[18:13:52] <wikibugs>	 (03Merged) 10jenkins-bot: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad)
[18:16:11] <wikibugs>	 (03CR) 10jenkins-bot: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad)
[18:17:08] <wikibugs>	 (03CR) 10RobH: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046) (owner: 10Dzahn)
[18:17:31] <logmsgbot>	 !log demon@tin Synchronized wmf-config/FeaturedFeedsWMF.php: no-op (duration: 00m 46s)
[18:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:50] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad)
[18:18:56] <wikibugs>	 (03CR) 10Dzahn: [C: 032] admins: add Kai Nissen (knissen) to LDAP (nda) users [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046) (owner: 10Dzahn)
[18:30:02] <hashar>	 mutante: Guten Tag.  I got some notification from you but the irc client froze eventually
[18:30:52] <mutante>	 hashar: i was just merging the "upgrade git on zuul mergers change" and saying (to others watching it) to not worry - because it looked like an upgrade now.. but actually it already happened in the past
[18:31:10] <mutante>	 so i just said "dont worry.. etc.. see comment by hashar above"
[18:31:30] <hashar>	 mutante: yeah sorry I forgot to poke the Gerrit task when I did the upgrade manually
[18:31:33] <mutante>	 and then i confirmed the change on contint1001, it added the backports config 
[18:31:44] <hashar>	 and apparently nothing happened since I did the change. So Iguess it is fine
[18:31:44] <mutante>	 but also the git package was already upgraded.. no-op
[18:31:51] <hashar>	 \o/
[18:32:07] <mutante>	 yep :)
[19:04:55] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3592485 (10aaron) >>! In T171071#3464513, @Marostegui wrote: > Hey,  >  > I would be nice to do a test with MariaDB 10.0 and 1...
[19:07:43] <wikibugs>	 (03PS2) 10Herron: WIP: icinga: add check_sysctl.sh script [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060)
[19:25:24] <wikibugs>	 (03PS1) 10Dzahn: site: remove unused virtual host 'zosma' [puppet] - 10https://gerrit.wikimedia.org/r/376779 (https://phabricator.wikimedia.org/T138650)
[19:27:45] <wikibugs>	 (03PS1) 10Dzahn: remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650)
[19:37:47] <mutante>	 !log removing ganeti instance 'zosma' on ganeti2001 (T138650)
[19:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:00] <stashbot>	 T138650: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650
[19:38:45] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "removed on ganeti2001 - nobody except ops ever logged in here (there were no access groups)" [puppet] - 10https://gerrit.wikimedia.org/r/376779 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn)
[19:39:23] <icinga-wm>	 PROBLEM - Host zosma is DOWN: PING CRITICAL - Packet loss = 100%
[19:40:33] <icinga-wm>	 ACKNOWLEDGEMENT - Host zosma is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T138650
[19:43:06] <mutante>	 !log zosma.codfw.wmnet - delete salt key, puppet node clean, puppet node deactivate, remove from Icinga,... (T138650)
[19:43:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:18] <stashbot>	 T138650: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650
[19:49:33] <wikibugs>	 (03PS2) 10Dzahn: remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650)
[19:50:03] <wikibugs>	 (03CR) 10Dzahn: [C: 032] remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn)
[19:52:04] <wikibugs>	 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3592768 (10Dzahn) @EddieGP Thanks, yea. I removed it. Should be all done now, also DNS.
[19:52:23] <wikibugs>	 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3592771 (10Dzahn)
[19:52:27] <wikibugs>	 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3592770 (10Dzahn) 05Open>03Resolved
[19:52:41] <wikibugs>	 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10Dzahn)
[19:52:43] <wikibugs>	 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2406322 (10Dzahn) 05Resolved>03declined
[20:03:20] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3592820 (10Dzahn) 05stalled>03declined
[20:06:37] <wikibugs>	 (03CR) 10Chad: [C: 032] Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad)
[20:08:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad)
[20:08:29] <wikibugs>	 (03CR) 10jenkins-bot: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad)
[20:10:02] <logmsgbot>	 !log demon@tin Synchronized scap/scap.cfg: drop git_repo for now, T175041 (duration: 00m 46s)
[20:10:11] <mutante>	 !log heze (bacula storage) - installing BIOS upgrade (T162850)
[20:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:17] <stashbot>	 T175041: scap sync failed on i18n - https://phabricator.wikimedia.org/T175041
[20:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:29] <stashbot>	 T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850
[20:14:19] <mutante>	 !log heze - rebooting for firmware upgrade
[20:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:20] <wikibugs>	 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3592861 (10Dzahn)
[20:24:39] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/Flow/includes/: bugs and stuff (duration: 00m 54s)
[20:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:30] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2047458
[20:30:17] <Reedy>	 no_justification: Adding or removing bugs?
[20:30:19] <wikibugs>	 10Operations, 10monitoring: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3592881 (10herron) >! In T173427#3591926, @akosiaris wrote: > > It's about avoiding transient network failure positives and making sure the failure is not a transient one.  This, but about puppet runs, is w...
[20:30:44] <no_justification>	 Reedy: Well, I wasn't undeploying Flow so I guess adding?
[20:30:47] <no_justification>	 ;-)
[20:33:13] <mutante>	 needs to reboot all 3 logstash servers.. but .. eh.. not now 
[20:35:11] <no_justification>	 mutante: Huh? 
[20:36:10] <mutante>	 BIOS upgrade
[20:36:23] <wikibugs>	 (03PS1) 10MaxSem: Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791
[20:36:29] <mutante>	 because they are Dell R320 and there was that weird CPU throttling on those sometimes
[20:36:48] <mutante>	 i suppose i could depool one of them at a time, reboot, repool,, wait .. next one etc
[20:37:05] <mutante>	 but also .., stuff like  https://wikitech.wikimedia.org/wiki/Service_restarts#Logstash
[20:37:09] <mutante>	 ?
[20:37:34] <mutante>	 this is just 1001-1003 though, not 1004-1006
[20:38:48] <mutante>	 yea, wasnt really related to the gerrit thing, but somebody who knows logstash well would be nice for both, heh
[20:39:18] <volans>	 mutante: you want guillaume ;)
[20:39:30] <volans>	 1004-6 are the new ones, not relly sure of the migration state
[20:39:36] <mutante>	 ah!
[20:39:50] <mutante>	 thanks volans, ok
[20:40:43] <volans>	 mutante: actually I might be confusing the numbers? T175045
[20:40:43] <stashbot>	 T175045: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045
[20:40:50] <volans>	 now I'm lost :D
[20:41:50] <mutante>	 oops, eh yea :)
[20:41:56] <mutante>	 i'll ask
[20:46:02] <no_justification>	 Wait, there's a new BIOS version to fix the CPU issue on R320s?
[20:46:56] <wikibugs>	 (03PS1) 10MaxSem: Add logging for email blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376792 (https://phabricator.wikimedia.org/T175419)
[20:48:19] <mutante>	 no_justification: yea, well.. we are installing  2.4.2 over 1.2.4 and havent seen one so far.. hope
[20:48:45] <mutante>	 maybe :)
[20:49:36] <wikibugs>	 (03CR) 10Kaldari: [C: 031] Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem)
[20:50:30] <wikibugs>	 (03PS11) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324)
[21:04:07] <no_justification>	 mutante: Tbf, we never managed to isolate which R320s did it anyway, so silence != fixed (sadly)
[21:07:21] <mutante>	 no_justification: we did to a certain extent  https://phabricator.wikimedia.org/T162850#3179492  
[21:07:42] <mutante>	 well the ones that got tickets are linked
[21:08:24] <no_justification>	 R320s with Jessie and a 4.x kernel. But I assume there's been others in that same set that haven't been affected?
[21:08:25] * no_justification shrugs
[21:08:42] <mutante>	 there are not that many R320s
[21:09:19] <mutante>	 as opposed to other models
[21:09:34] <mutante>	 and only R320s have been affected. within their group the percentage is pretty high
[21:09:46] <mutante>	 (before upgrades)
[21:10:10] * no_justification nods
[21:23:34] <paladox>	 hi, does anyone know where does logstash store things by default if you have not configured a dashboard for it
[21:23:46] <paladox>	 cc mutante ^^
[21:24:35] <Zppix>	 paladox:  what are you looking for ? maybe i can help you find it (i assume logstash-beta aka the cloud services instance of logstash?)
[21:25:14] <paladox>	 Zppix it's logstash, /me and mutante are testing logstash for gerrit. But before anything is done in gerrit's config we are testing through the command line to see if it works
[21:25:24] <paladox>	 logstash prod
[21:25:44] <Zppix>	 oh nevermind i cannot (prod logstash is outside of my access level)
[21:26:17] <mutante>	 do you have logstash not-prod?
[21:26:36] <paladox>	 nope
[21:27:56] <Zppix>	 mutante:  i can only access logstash-beta lol
[21:28:21] <mutante>	 well, that's not-prod :)
[21:29:47] <mutante>	 so yes
[21:31:12] <mutante>	 Zppix: it's about making a nice dashboard i guess
[21:32:10] <Zppix>	 i dont know much about logstash other than to look at it to see how bad i screwed something up xD
[21:32:38] <mutante>	 ;)
[21:43:01] <icinga-wm>	 PROBLEM - Host mc1002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:10] <volans>	 this is me ^^^
[21:43:21] <icinga-wm>	 RECOVERY - Host mc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[21:49:11] <chasemp>	 volans: you should be off enjoying the weekend :)
[21:49:51] <volans>	 chasemp: you got me!
[21:54:42] <mutante>	 !log gerrit2001 - restarting gerrit to test logstash config change (while not touching "live" gerrit)
[21:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:22] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3593074 (10Pchelolo)
[21:57:59] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10Pchelolo) Thank you @EBernhardson, updated the task with your info. Now we've got a complete list of jobs executed in pro...
[22:19:58] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[22:20:58] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[22:22:13] <wikibugs>	 (03Draft1) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input  [puppet] - 10https://gerrit.wikimedia.org/r/376836
[22:22:16] <wikibugs>	 (03PS2) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input  [puppet] - 10https://gerrit.wikimedia.org/r/376836
[22:29:06] <wikibugs>	 (03PS1) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494)
[22:29:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:29:52] <wikibugs>	 (03PS2) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494)
[22:29:54] <wikibugs>	 (03PS2) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494)
[22:30:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:30:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:33:34] <wikibugs>	 (03Abandoned) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input  [puppet] - 10https://gerrit.wikimedia.org/r/376836 (owner: 10Paladox)
[22:34:35] <wikibugs>	 (03PS3) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494)
[22:34:37] <wikibugs>	 (03PS3) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494)
[22:35:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:35:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:37:20] <wikibugs>	 (03PS4) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494)
[22:37:22] <wikibugs>	 (03PS4) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494)
[22:37:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[23:09:20] <mutante>	 !log phab1001, phab2001 - terminating 3 unused screen processes that were from iridium migration / data sync
[23:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:39] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[23:27:39] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[23:32:48] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:38:12] <mutante>	 !log netmon1002 - terminated unused screen session
[23:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:55] <wikibugs>	 (03PS1) 10EBernhardson: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852
[23:42:29] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson)
[23:44:06] <wikibugs>	 (03Merged) 10jenkins-bot: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson)
[23:46:09] <wikibugs>	 (03CR) 10jenkins-bot: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson)
[23:46:22] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-rel-survey.php: T171740: Fix inverted sampling rates for human relevance survey (duration: 00m 47s)
[23:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:36] <stashbot>	 T171740: [Epic] Search Relevance: graded by humans - https://phabricator.wikimedia.org/T171740