[00:01:42] 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10BBlack) [00:42:20] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/AbuseFilter/includes/Views/AbuseFilterViewExamine.php: fix comment stuff (duration: 00m 46s) [00:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:44] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590296 (10demon) >>! In T175288#3589585, @RobH wrote: > I'm going to install with jessie, like tin and naos both presently have... [00:50:46] (03CR) 10Chad: [C: 031] "This can land whenever." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [00:57:18] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590306 (10bd808) Moving the deploy server to stretch should probably wait for a stretch HHVM build shouldn't it? [02:03:27] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [02:06:57] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycache: try to repair it on query. Default database: dewiki. [Query snipped] [02:12:48] (03PS5) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [03:34:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.59 seconds [03:40:03] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3590434 (10demon) Probably a decent idea. Ignore my idea. [03:46:27] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycache: try to repair it on query. Default database: dewiki. [Query snipped] [03:56:21] https://phabricator.wikimedia.org/T175304 <- this is a serious issue [04:09:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 131.62 seconds [04:43:38] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [04:45:37] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.007 second response time [04:53:37] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89790.55 seconds [05:06:11] yannf: Have you found the culprit? [05:16:27] RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational [05:37:38] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [05:38:38] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [05:57:47] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:57:57] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:57:57] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:57:57] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:57:58] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:57:58] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:57:58] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:07] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:08] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:08] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:17] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:17] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:27] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:28] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:37] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:37] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:37] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:37] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:58:37] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:58:47] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:59:27] RECOVERY - jmxtrans on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [06:01:17] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:17] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:17] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:17] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [06:01:37] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:37] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:37] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:38] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:38] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:38] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:47] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:47] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:57] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:01:58] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:01:58] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:01:58] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:02:07] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:02:07] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:02:07] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:03:23] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590510 (10elukey) ``` elukey@kafka-jumbo1001:/usr/share/jmxtrans$ source /etc/default/jmxtrans elukey@kafka-jumbo1001:/usr/share/jmx... [06:03:35] gehel: ---^ (just as FYI) [06:32:36] (03CR) 10Muehlenhoff: [C: 04-1] "She's already in ldap_only_users, so that needs to be removed when getting shell access in addition." [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [06:34:40] 10Operations, 10Ops-Access-Requests: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3590515 (10MoritzMuehlenhoff) a:03RobH Assigning to Rob, who created both the LDAP and shell user access. This also triggers a warning in the daily account... [06:35:50] (03PS1) 10Elukey: jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) [06:39:45] (03CR) 10Reedy: "We could use symlinks instead... But if this is going to be shortlived...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [06:41:33] (03CR) 10DCausse: [V: 032 C: 032] adding Priority: optional to metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376568 (owner: 10Gehel) [06:55:06] (03PS1) 10Ema: prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665 [06:56:42] (03CR) 10Muehlenhoff: [C: 032] Also print amount of hosts not requiring a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203 (owner: 10Muehlenhoff) [06:58:04] (03PS2) 10Elukey: jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) [06:58:30] (03PS2) 10Muehlenhoff: Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 [07:01:55] (03CR) 10Muehlenhoff: [C: 032] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff) [07:02:48] (03PS2) 10Muehlenhoff: Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 [07:03:59] (03CR) 10Muehlenhoff: [C: 032] Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 (owner: 10Muehlenhoff) [07:04:58] (03PS2) 10Muehlenhoff: Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277 [07:06:24] (03CR) 10Muehlenhoff: [C: 032] Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277 (owner: 10Muehlenhoff) [07:10:49] (03PS2) 10Muehlenhoff: Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213 [07:11:27] (03CR) 10Muehlenhoff: [C: 032] Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213 (owner: 10Muehlenhoff) [07:11:33] (03PS2) 10Ema: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack) [07:13:13] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3590555 (10Gilles) [07:13:26] (03PS2) 10Muehlenhoff: Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212 [07:13:44] (03CR) 10Volans: [C: 032] "Trivial fix, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/374132 (owner: 10Volans) [07:15:01] (03CR) 10Muehlenhoff: [C: 032] Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212 (owner: 10Muehlenhoff) [07:15:36] (03Merged) 10jenkins-bot: CLI: fix --version option [software/cumin] - 10https://gerrit.wikimedia.org/r/374132 (owner: 10Volans) [07:16:36] (03CR) 10Volans: [C: 032] "Trivial fix, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/374133 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans) [07:18:17] (03Merged) 10jenkins-bot: Fix data_files installation directory [software/cumin] - 10https://gerrit.wikimedia.org/r/374133 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans) [07:18:32] (03PS2) 10Volans: Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) [07:18:46] (03PS2) 10Muehlenhoff: Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265 [07:19:50] (03CR) 10Muehlenhoff: [C: 032] Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265 (owner: 10Muehlenhoff) [07:21:10] !log mobrovac@tin Started deploy [restbase/deploy@e6aeeeb] (dev-cluster): (no justification provided) [07:21:22] !log mobrovac@tin Finished deploy [restbase/deploy@e6aeeeb] (dev-cluster): (no justification provided) (duration: 00m 11s) [07:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:04] (03PS2) 10Muehlenhoff: Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247 [07:22:45] (03CR) 10Muehlenhoff: [C: 032] Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247 (owner: 10Muehlenhoff) [07:23:44] (03CR) 10Volans: [C: 032] Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) (owner: 10Volans) [07:24:03] (03PS2) 10Muehlenhoff: Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244 [07:24:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590584 (10elukey) Applied the following to all the nodes to remove the placeholder logical volumes: ``` root@kafka-jumbo1001:/home/... [07:25:10] !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): (no justification provided) [07:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:30] (03Merged) 10jenkins-bot: Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) (owner: 10Volans) [07:25:47] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [07:26:03] (03CR) 10Muehlenhoff: [C: 032] Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244 (owner: 10Muehlenhoff) [07:32:11] !log mobrovac@tin Finished deploy [restbase/deploy@0d39acf] (dev-cluster): (no justification provided) (duration: 07m 02s) [07:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:53] 10Operations, 10Analytics, 10netops, 10User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3590602 (10elukey) The next step is to design and add the `analytics-in6` filter to cr1/cr2 eqiad, but I would wait for kafka1012-1022 to be decommissioned before that. Those h... [07:46:37] RECOVERY - Restbase root url on restbase-dev1004 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.034 second response time [07:48:59] (03Abandoned) 10Volans: wmf-auto-reimage: support mutiple conftool roles [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) (owner: 10Volans) [07:51:50] https://phabricator.wikimedia.org/T174269 [07:53:20] !log installing ruby2.3 security updates [07:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:38] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [08:07:58] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [08:11:10] !log restbase enabled back puppet in the dev cluster - T169940 [08:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:23] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [08:11:55] !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster - T169940 [08:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:06] !log installing libgd2 security updates [08:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "The rule names are correct, syntax is missing ()" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376665 (owner: 10Ema) [08:20:15] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590648 (10elukey) I could be wrong but from cr1/cr2 eqiad the hosts seem to be in the Analytics VLAN, and they shouldn't be: ``` el... [08:21:30] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590649 (10jcrespo) > Already answered above. CRIT on Icinga doesn't mean "can cause an outage". But it's also easy to change to WARN. Thank you; I was in some cases ignoring W... [08:23:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590651 (10elukey) @Ottomata: let's also remember to whitelist the jumbo IPs in the Analytics VLAN firewall rules, otherwise hosts li... [08:26:19] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590653 (10jcrespo) cc @Faidon ^ [08:42:18] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:42:20] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3590669 (10fgiunchedi) >>! In T170120#3587521, @mobrovac wrote: > On the metrics side, we standardised on the StatsD format, but +1 on using Prometheus... [08:43:45] (03PS2) 10Muehlenhoff: Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249 [08:44:46] (03CR) 10Muehlenhoff: [C: 032] Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249 (owner: 10Muehlenhoff) [08:51:36] (03PS2) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245 [08:52:30] (03PS3) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245 [08:53:14] (03CR) 10Muehlenhoff: [C: 032] Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245 (owner: 10Muehlenhoff) [08:54:17] (03PS2) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [08:55:03] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590701 (10jcrespo) > Were security considerations taken into account? Security considerations with pinging on a public channel which servers security patches (in the form of m... [08:55:09] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10fgiunchedi) Those messages are due to acpi power meter, which we blacklist as of https://gerrit.wikimedia.org/r/#/c/356422/. A reboot should make the message go away. [08:56:32] (03PS2) 10Muehlenhoff: Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274 [08:57:09] (03CR) 10Muehlenhoff: [C: 032] Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274 (owner: 10Muehlenhoff) [09:00:59] (03PS1) 10Muehlenhoff: Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376672 [09:01:38] !log installing remaining gnutls updates from jessie point release [09:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:00] !log mobrovac@tin Started deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster, take #2 - T169940 [09:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:14] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [09:08:01] !log mobrovac@tin Finished deploy [restbase/deploy@0d39acf] (dev-cluster): Use double writing for mobileapps in the dev cluster, take #2 - T169940 (duration: 01m 01s) [09:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:17] RECOVERY - Restbase root url on restbase-dev1005 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.014 second response time [09:09:07] RECOVERY - Restbase root url on restbase-dev1006 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.017 second response time [09:09:15] !log installing libonig security updates [09:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:51] (03CR) 10Gehel: "Wow! That's a lot of GC tuning! The current upstream has gone back to a much saner default (https://github.com/jmxtrans/jmxtrans/blob/mast" [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [09:15:57] (03CR) 10Gehel: [C: 031] "Another note, the default heap / perm size (512/384) are probably far too large for our use case (but I have not looked at any data, just " [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [09:16:37] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3590739 (10fgiunchedi) FTR: username is root and password is the management password. I checked the webui for ps1-a2-eqiad and the current/voltage readings are... [09:16:56] (03PS1) 10Gilles: Upgrade to 1.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/376674 (https://phabricator.wikimedia.org/T173580) [09:20:25] (03PS1) 10Gilles: Thumbor: enable new MAX_ANIMATED_GIF_AREA option [puppet] - 10https://gerrit.wikimedia.org/r/376676 (https://phabricator.wikimedia.org/T173580) [09:24:34] (03CR) 10Elukey: [C: 032] jmxtrans.sh: reduce MaxTenuringThreshold to 15 [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/376663 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [09:25:08] !log restarting apache on tegmen/einsteinium to pick up security update [09:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:20] 10Operations, 10DC-Ops: Review and fix PDU settings for syslog/ntp servers - https://phabricator.wikimedia.org/T175341#3590787 (10fgiunchedi) [09:43:44] 10Operations, 10DC-Ops: Review and fix PDU settings for syslog/ntp/email servers - https://phabricator.wikimedia.org/T175341#3590843 (10fgiunchedi) [09:44:49] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3590846 (10Gilles) [09:46:41] 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3590850 (10Joe) Running the mdadm command on one host caused a re-election to happen. It seems likely we found the culprit, so now I'm going to run the command at the same time on first two hos... [09:48:00] <_joe_> !log running `/usr/share/mdadm/checkarray --cron --all --idle --quiet` on conf2001 and conf2003 trying to reproduce consensus issues in T162013 [09:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:13] T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013 [09:58:14] <_joe_> !log running the same command on conf2002 too [09:58:18] ack [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] <_joe_> volans: this time I'm not seeing leader loss or anything, uhm [09:59:18] <_joe_> so let's see with all three active [09:59:28] which md drive it's checking now? [09:59:40] <_joe_> it already got to md2 [09:59:47] <_joe_> on conf2001 and 2003 [10:00:06] Esther, not fixed yet [10:00:11] https://phabricator.wikimedia.org/T175304 <- this is a serious issue [10:00:13] <_joe_> and we have 15 to 10% iowait there [10:00:29] let's see [10:01:17] RECOVERY - Check systemd state on restbase2003 is OK: OK - running: The system is fully operational [10:06:17] (03PS1) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 [10:07:44] !log testing wmf-auto-reimage on mc1001 T166300 T164341 [10:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:57] T166300: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300 [10:07:58] T164341: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341 [10:08:48] (03CR) 10Muehlenhoff: [C: 032] Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376672 (owner: 10Muehlenhoff) [10:09:29] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3590908 (10elukey) @Ottomata: I merged https://gerrit.wikimedia.org/r/#/c/376663 but I then realized that master/debian branches are... [10:09:38] <_joe_> volans: here we are [10:09:42] <_joe_> consensus lost [10:09:53] <_joe_> !log etcd cluster lost consensus T162013 [10:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:06] T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013 [10:10:14] <_joe_> this will page people, sorry for that [10:10:54] _joe_: disabled active checks, so no page ;) [10:11:14] <_joe_> volans: oh ok, I would've loved to see after how much time we would notice this [10:12:18] 3 checks [10:12:29] <_joe_> !log stopped all md devices syncs on conf200* machines [10:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:49] <_joe_> and as soon as I did that, consensus is back [10:13:35] <_joe_> ok, so we can just stagger the resyncs across different dates [10:13:44] <_joe_> or reduce the speed of the resync via sysctl [10:13:53] <_joe_> actually I think I'll do that [10:14:12] md checks should be ideally scattered inside a cluster [10:14:14] like puppet runs [10:14:33] <_joe_> volans: still, do you agree we might even try to just reduce the resync speed? [10:14:38] sure [10:15:04] <_joe_> I'll post these observations to the ticket. Again, great catch [10:15:30] I was thinking to use the last 2 digit of the hostname (for names like nameXXXX) modulo 28 as the date of the month to do that :-P [10:15:40] _joe_: re-enabling active checks [10:15:42] <_joe_> (I restarted replication btw) [10:15:46] <_joe_> yeah please do [10:15:57] <_joe_> I'm going afk for a bit, but everything is ok no [10:15:59] <_joe_> *now [10:16:41] https://phabricator.wikimedia.org/T175304 <- this is a serious issue < Reedy / legoktm / no_justification willing to do a weekend deploy? [10:18:01] this prevents most works on all Wikisource [10:18:10] grgmbmbm [10:18:31] yannf: p858snake: do you have a way to reproduce? I dont mind deploying it now [10:19:36] and if you have the X-Wikimedia-Debug browser extension https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions you would even be able to test it :] [10:20:03] also apparently the patch got merged on beta, so the issue should no more show up on https://en.wikisource.beta.wmflabs.org/wiki/Main_Page [10:20:55] (03PS1) 10Muehlenhoff: Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689 [10:21:24] hashar, you need a page in the Page: namespace [10:21:44] i.e. so probably also an index [10:23:42] hashar, the buttons should appear here at the bottom https://fr.wikisource.org/w/index.php?title=Page:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol12.djvu/1&action=edit&redlink=1 [10:23:58] there is none right now [10:24:07] ok [10:24:52] +2 ed [10:25:02] and will deploy on the debug servers once the patch has merged [10:25:14] then try the above link using the X-Wikimedia-Debug browser extension [10:26:16] https://en.wikisource.beta.wmflabs.org/wiki/Page:Dictionary_of_National_Biography_volume_51.djvu/397 here [10:26:33] (03CR) 10Elukey: [C: 031] Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689 (owner: 10Muehlenhoff) [10:27:13] not sure why, the index doesn't show up properly on Beta [10:28:43] I can deploy [10:29:30] hashar: you are already on it, awesome :) [10:30:02] ok it got merged [10:31:43] p858snake: yannf: Amir1: deployed on mwdebug1001 [10:33:48] I can't see it [10:33:54] tried several times [10:34:18] I can [10:34:24] on mwdebug1001 the button show up [10:35:02] !log hashar@tin Synchronized php-1.30.0-wmf.17/extensions/ProofreadPage: Restore page status buttons - T175304 (duration: 00m 50s) [10:35:09] yannf: try again ? :] [10:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:15] T175304: Page status buttons do not appear any more in Wikisource - https://phabricator.wikimedia.org/T175304 [10:35:20] Amir1: I used https://fr.wikisource.org/w/index.php?title=Page:Tolsto%C3%AF_-_%C5%92uvres_compl%C3%A8tes,_vol12.djvu/1&action=edit&redlink=1 [10:35:37] I can see it now [10:36:55] I marked it as fixed up [10:39:01] yeah, good now [10:39:05] thanks a lot! [10:47:31] (03CR) 10Muehlenhoff: [C: 032] Remove memcached/redis salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376689 (owner: 10Muehlenhoff) [10:48:58] 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591019 (10Joe) Result of the latest experiment: - Consensus was lost as soon as resync reached md2 on all servers - `iowait` rose on all servers above 10%, see https://grafana.wikimedia.org/d... [10:59:57] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3591039 (10Joe) > I am a bit more concerned about performance and reliability implications of adding indirections in the data path itself. TLS is suppo... [11:01:44] <_joe_> win 27 [11:01:54] <_joe_> even with a slash in front [11:06:16] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3591084 (10phuedx) [11:07:11] (03PS1) 10Muehlenhoff: Remove remaining salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376702 [11:12:33] PROBLEM - Host mc1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:14:16] mc1001 is the old server batch, might be decom in progress [11:15:23] RECOVERY - Host mc1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:16:58] that's me [11:17:01] see SAL [11:17:05] testing the reimage script [11:17:17] * volans wonders how can be in icinga though [11:25:03] PROBLEM - Host mc1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:23] RECOVERY - Host mc1001 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [11:30:55] (03CR) 10Volans: "See comment inline" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [11:53:31] (03PS1) 10Mobrovac: Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) [11:56:24] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:34] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:34] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:35] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:45] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:54] PROBLEM - puppet last run on dumpsdata1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:54] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:04] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:04] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:05] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:14] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:24] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:24] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:25] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:25] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:34] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:35] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:35] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:02:18] (03CR) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [12:02:29] (03PS2) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 [12:08:03] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3591270 (10mobrovac) [12:12:07] akosiaris: FYI puppetdb was restarted on nitrogen by the OOM-killer ^^^ [12:12:22] (killed by the OOM-killer, restarted by systemd ofc ;) ) [12:14:32] ah nice [12:16:28] seems that we still don't have a stable memory config for it :( [12:21:56] why is OOM showing up though [12:22:14] the box still had 5.33GB in cached memory [12:22:26] (03CR) 10Giuseppe Lavagetto: "Please remove the system::role declaration. Otherwise, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [12:22:28] (03CR) 10Volans: "Already nicer, I think we can save another couple of level of indentation ;)" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [12:22:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comment. Otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [12:24:34] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:24:44] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:24:44] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:24:54] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:24:54] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:25:04] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:25:06] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:25:14] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:25:14] RECOVERY - puppet last run on dumpsdata1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:25:24] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:25:24] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:25:25] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:25:34] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:25:44] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:25:45] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:25:54] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:25:54] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:25:54] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:29:47] (03PS2) 10Mobrovac: Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) [12:30:09] (03CR) 10Mobrovac: Add the cp-jobqueue profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [12:34:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think we probably want it to become CRIT at some point." [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [12:35:48] (03CR) 10Alexandros Kosiaris: [C: 032] Add the cp-jobqueue profile [puppet] - 10https://gerrit.wikimedia.org/r/376707 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [12:36:05] (03PS2) 10Alexandros Kosiaris: kubernetes: Add a few recommended admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374974 (https://phabricator.wikimedia.org/T170119) [12:36:12] <_joe_> !log reduced raid resync max speed of raid devices on conf200* to 20000 [12:36:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Add a few recommended admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374974 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [12:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:29] (03PS2) 10Alexandros Kosiaris: Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111) [12:36:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111) (owner: 10Alexandros Kosiaris) [12:36:52] <_joe_> !log restarting a check of the md2 devices on conf200* T162013 after reducing sync speed [12:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:04] T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013 [12:40:08] (03PS1) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [12:40:47] (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [12:52:16] 10Operations, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591347 (10Joe) Reducing the sync speed manually did the job, so we can just puppetize this. [12:57:44] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cp-jobqueue] [12:57:55] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cp-jobqueue] [12:59:45] mobrovac: ^^^ [13:00:03] damn [13:00:50] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3591354 (10faidon) [13:04:01] volans: i can't do/see anything on tin (no perms). can you take a look at what does the puppet log say there? [13:05:15] mobrovac: exec of '/usr/bin/git -c core.sharedRepository=group clone --recurse-submodules https://gerrit.wikimedia.org/r/p/mediawiki/services/cp-jobqueue.git /srv/deployment/cp-jobqueue/cp-jobqueue' returned 128 [13:05:32] 10Operations, 10Mail: Split MXes into inbound and outbound - https://phabricator.wikimedia.org/T175362#3591376 (10faidon) [13:05:45] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:06:19] volans: uf ok, so puppet first creates the dir and then tries to clone there, which fails because the dir exists [13:06:31] scap seems to assume that a repo always has a slash in it [13:06:33] lol [13:06:34] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [13:06:44] PROBLEM - HHVM rendering on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:06:53] _joe_: ^^^ [13:06:54] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 80330 bytes in 0.233 second response time [13:07:07] or moritzm ^^^ [13:07:21] _joe_: cp-jobqueue will not do it, we'll have to have a separate deploy repo for that; how about change-propagation/jobqueue-deploy ? [13:07:24] <_joe_> looking [13:07:32] <_joe_> mobrovac: later, sorry [13:07:34] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.128 second response time [13:07:44] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 80329 bytes in 0.116 second response time [13:09:01] <_joe_> volans: same issue we had other times [13:09:14] ack [13:14:45] (03PS1) 10Giuseppe Lavagetto: etcd: limit RAID resync speed if on linux software raid [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013) [13:15:42] _joe_: our abstractions are nice, aren't they [13:15:57] a fact, a sysctl definition [13:19:07] <_joe_> paravoid: indeed! [13:19:55] if the raid resync is affecting etcd's consensus, maybe it's too sensitive? [13:20:14] <_joe_> yeah but the servers got to have ~ 20% iowait [13:20:49] (03PS1) 10Muehlenhoff: Add a new debdeploy command query_version [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 [13:20:52] <_joe_> and I still have to understand why that doesn't happen in eqiad [13:21:23] <_joe_> paravoid: I'm more comfortable reducing the raid sync speed than with fiddling with RAFT timeouts on a friday afternoon, too :) [13:21:46] nod [13:22:29] (03PS2) 10Muehlenhoff: Add a new debdeploy command query_version [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 [13:28:02] (03CR) 10Volans: [C: 031] "LGTM, compiler for reference available here:" [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [13:28:39] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: limit RAID resync speed if on linux software raid [puppet] - 10https://gerrit.wikimedia.org/r/376712 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [13:29:07] (03CR) 10Gehel: "LGTM in general. I'd like to find someone who has some time to check the state of maps codfw a bit more before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack) [13:34:54] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:35:13] <_joe_> uhm [13:35:40] <_joe_> mobrovac: it worked ^^ [13:35:46] <_joe_> "eventual consistency :P [13:39:04] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:53] 10Operations, 10fundraising-tech-ops, 10procurement: cost estimate for two prometheus+grafana servers for fundraising - https://phabricator.wikimedia.org/T175364#3591415 (10Jgreen) [13:40:22] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591441 (10Joe) 05Open>03Resolved a:03Joe [13:51:44] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:33] (03CR) 10Muehlenhoff: Add ferm service for rpc.statd on labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [14:00:04] (03PS2) 10Muehlenhoff: Add ferm service for rpc.statd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) [14:02:15] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10mark) @Dzahn: thanks for working on this! The current implementation with WARNINGs and CRITICAL at higher threshold seems like a good start. If needed, we can readjus... [14:06:35] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:08:01] 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591573 (10Suhadakashter) [14:08:57] 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10Suhadakashter) [14:09:28] 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591614 (10Reedy) 05duplicate>03Open [14:11:43] (03PS2) 10Ema: prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665 [14:12:02] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591637 (10fgiunchedi) I checked with @mark and the current readings are per-phase, since we're using "3 Wye" phase configuration and a server will consume from... [14:14:54] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591653 (10fgiunchedi) The graphs for codfw and eqiad are also reported here (stacked) https://grafana.wikimedia.org/dashboard/db/site-power-usage [14:15:13] 10Operations: Integrate stretch 9.1 point release - https://phabricator.wikimedia.org/T171453#3591654 (10MoritzMuehlenhoff) These are fully rolled out: nagios-nrpe gnutls28 perl [14:15:39] (03Abandoned) 10Andrew Bogott: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:16:26] (03Restored) 10Andrew Bogott: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:17:16] (03CR) 10Andrew Bogott: "Chase, did you mean to -1 https://gerrit.wikimedia.org/r/#/c/375941/ which actually is redundant?" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:20:14] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:20:42] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, after the merge the metric will start to be generated. The dashboards can be adjusted in a week or so when the new metrics have some" [puppet] - 10https://gerrit.wikimedia.org/r/376665 (owner: 10Ema) [14:22:58] (03CR) 10Rush: "Yep!" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:23:17] (03CR) 10Rush: [C: 04-1] "Already side merged this change :D" [puppet] - 10https://gerrit.wikimedia.org/r/375941 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:25:35] (03CR) 10Hashar: [C: 031] "Oops I forgot to report back on this change. I did the upgrade on 8/31:" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [14:25:44] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:26:34] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[systemd-timesyncd] [14:26:45] (03PS2) 10Rush: nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:27:17] (03CR) 10Hashar: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/374999 (owner: 10Hashar) [14:27:41] (03PS1) 10Mobrovac: ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281) [14:33:34] (03PS1) 10Alexandros Kosiaris: role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600) [14:34:21] (03PS1) 10Hashar: nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) [14:34:42] (03Abandoned) 10Andrew Bogott: nova: make default 'nova' availability-zone explicit [puppet] - 10https://gerrit.wikimedia.org/r/375941 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:34:51] (03CR) 10Andrew Bogott: [C: 032] nodepool: specify 'nova' availability zone [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [14:35:06] (03CR) 10Hashar: "This week I have migrated a lot of php 5.5 jobs from trusty to jessie :] So we need less trusty instances." [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [14:39:16] (03PS2) 10Andrew Bogott: nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [14:40:49] (03CR) 10Andrew Bogott: [C: 032] nodepool: reduce trusty pool by one [puppet] - 10https://gerrit.wikimedia.org/r/376727 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [14:42:50] (03PS2) 10Alexandros Kosiaris: ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [14:42:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ChangeProp-JobQueue: Fix the repo location and name [puppet] - 10https://gerrit.wikimedia.org/r/376724 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [14:50:22] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3591719 (10mobrovac) [14:52:45] (03PS2) 10Alexandros Kosiaris: role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600) [14:52:47] (03PS1) 10Alexandros Kosiaris: dnsrecursor: Enable PowerdnsRecursor diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/376732 (https://phabricator.wikimedia.org/T169600) [14:55:02] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3591738 (10mobrovac) The repo has been set up and cloned on `tin` and the `ops/puppet` profile created and merged. Left to do is to... [14:55:05] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:56:39] (03CR) 10Muehlenhoff: [C: 031] [WMF] jessie: tweak build dependencies [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [14:56:54] (03CR) 10Muehlenhoff: [V: 032 C: 032] [WMF] jessie: tweak build dependencies [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [14:57:24] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[cpjobqueue/deploy] [14:58:08] !log testing wmf-auto-reimage also on mc1002 T166300 T164341 [14:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] T166300: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300 [14:58:24] T164341: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341 [15:01:24] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:02:25] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4021.* [15:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:13] (03PS1) 10Alexandros Kosiaris: Force the ssh key to be used by scap [software/librenms] - 10https://gerrit.wikimedia.org/r/376734 (https://phabricator.wikimedia.org/T172333) [15:04:39] (03CR) 10Alexandros Kosiaris: [C: 032] role::dnsrecursor: Partially transform into a profile [puppet] - 10https://gerrit.wikimedia.org/r/376726 (https://phabricator.wikimedia.org/T169600) (owner: 10Alexandros Kosiaris) [15:04:41] (03CR) 10Alexandros Kosiaris: [C: 032] dnsrecursor: Enable PowerdnsRecursor diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/376732 (https://phabricator.wikimedia.org/T169600) (owner: 10Alexandros Kosiaris) [15:04:50] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [15:05:09] ah [15:05:12] _joe_: still testing? [15:05:13] <_joe_> uhm [15:05:14] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [15:05:17] _joe_: is that you? [15:05:23] <_joe_> volans: actually yes, the resync is ongoing [15:05:28] a ok [15:05:30] <_joe_> but this wasn't expected tbh [15:05:44] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:05:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Force the ssh key to be used by scap [software/librenms] - 10https://gerrit.wikimedia.org/r/376734 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [15:06:55] !log akosiaris@tin Started deploy [librenms/librenms@5554213]: Testing new scap keyholder_key config [15:06:59] !log akosiaris@tin Finished deploy [librenms/librenms@5554213]: Testing new scap keyholder_key config (duration: 00m 04s) [15:07:05] <_joe_> it already recovered btw [15:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:14] <_joe_> uhm, nope [15:07:15] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:27] <_joe_> ok, time to stop the raid resync again [15:07:44] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [15:08:00] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [15:08:39] btw, the fact that a RAID resync would be killing the etcd replication is a "who would have thought" moment [15:09:35] heheh [15:14:05] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591796 (10RobH) So Chris flipped around SDA and SDB but the installer still gives an error for SDA not being present. This is likely... [15:14:14] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591799 (10RobH) [15:15:29] (03PS1) 10Alexandros Kosiaris: Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) [15:16:12] (03CR) 10Alexandros Kosiaris: "Not urgent, but it would be nice to revert it at some point. Let me know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [15:21:19] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591837 (10Johan) A couple of discussions: [[ https://en.w... [15:21:21] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591838 (10Joe) I was too optimistic, it appears, in declaring victory. The new resync at reduced speeds still triggered consensus issues. It seems the version of etcd we'... [15:21:27] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3591839 (10Joe) 05Resolved>03Open [15:35:05] (03PS3) 10Andrew Bogott: labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452 [15:35:12] godog: lmk when you have a moment? I have more prometheus questions (or, really, the same prometheus questions) [15:36:34] (03PS1) 10Hashar: zuul: allow email connection [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) [15:51:10] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3591910 (10akosiaris) [15:51:15] 10Operations, 10Diamond, 10Traffic, 10monitoring, and 2 others: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3591908 (10akosiaris) 05Open>03Resolved Patches created and merged. Got a basic dashboard working at https://grafana.wikimedia.org/dashbo... [15:53:34] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4027.* [15:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:19] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591915 (10RobH) Chris also reseated it all, still error. [16:00:44] 10Operations, 10monitoring: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3527972 (10akosiaris) >>! In T173427#3553083, @Volans wrote: > An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-pup... [16:03:02] !log demon@tin Synchronized php-1.30.0-wmf.17/includes/revisiondelete/RevDelLogList.php: typofix (duration: 00m 46s) [16:03:14] (03PS1) 10Chad: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/AbuseFilter/includes/Views/AbuseFilterViewExamine.php: T175338, followup (duration: 00m 46s) [16:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] T175338: Fatal: Wikimedia\Rdbms\DBQueryError on Special:AbuseFilter/examine: "Error: 1054 Unknown column 'rev_id' in 'field list' (10.64.16.191)" - https://phabricator.wikimedia.org/T175338 [16:06:04] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591939 (10BBlack) Thanks! So far, I haven't heard of any... [16:06:53] (03CR) 10Dzahn: [C: 031] "sounds reasonable to revert it since that was just a workaround for the scap issue which sounds resolved" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [16:13:03] (03CR) 10Dzahn: "oh, thanks for pointing that out. amending" [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [16:15:52] (03PS2) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) [16:16:41] (03PS3) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) [16:18:29] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3591956 (10Cmjohnson) [16:19:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10Cmjohnson) bios is setup, raid is configured to raid 10. switch ports need setup still 1019 -> b4 ge-4/0/33 1020 -> b7 ge-7/0/13 [16:20:44] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3591959 (10Cmjohnson) @MoritzMuehlenhoff is it okay to resolve this task? [16:21:04] (03CR) 10Dzahn: [C: 032] "this doesn't give actual access yet - adding to the right group has to be separate anyways for technical reasons and needs to wait until M" [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [16:28:44] (03PS1) 10Dzahn: admins: add rho to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204) [16:30:11] andrewbogott: heh, I have to run now, we can do next week tho! [16:30:16] monday that is [16:30:27] ok [16:30:35] (03CR) 10Dzahn: "not yet, but can be done on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [16:34:22] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3591976 (10Dzahn) @Rho The 2 changes above are prepared now but not merged yet. They will create your user account (step 1) and then add it to th... [16:37:28] (03CR) 10Dzahn: [C: 04-1] "yea, actually i think that too, we can just set the threshold really high. i just wanted to offer all the options since there was some dis" [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [16:37:43] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591981 (10Johan) To be honest, most of the community is b... [16:38:20] (03PS4) 10Dzahn: contint: upgrade git on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [16:39:34] (03CR) 10Dzahn: [C: 032] "don't worry, it's not NOW, the upgrade has already happened in the past, see comment from Hashar" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [16:41:57] (03CR) 10Dzahn: "Info: Applying configuration version '1504888877'" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [16:44:05] (03CR) 10Dzahn: "git is already the newest version." [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [16:52:38] (03PS1) 10RobH: dsaez uid update [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220) [16:53:46] (03CR) 10Dzahn: [C: 031] "renaming diego to dsaez seems right and should fix consistency warning and SWAP access" [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220) (owner: 10RobH) [16:53:52] (03PS1) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 [16:54:21] (03CR) 10jerkins-bot: [V: 04-1] restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (owner: 10ArielGlenn) [16:57:03] (03CR) 10RobH: [C: 032] dsaez uid update [puppet] - 10https://gerrit.wikimedia.org/r/376749 (https://phabricator.wikimedia.org/T175220) (owner: 10RobH) [16:57:45] (03PS1) 10BBlack: [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 [17:02:44] (03CR) 10Muehlenhoff: [C: 031] "Sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [17:02:53] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3592052 (10Papaul) Hi Papaul, Here is the dispatch information for service to address system unresponsiveness of the PowerEdge R430 with service tag FXLPND2. Next business day service dispatch Service date d... [17:04:17] (03PS2) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 [17:04:53] (03CR) 10Muehlenhoff: [C: 031] admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [17:06:12] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:07:06] ^ that's gonna be related to the user renaming [17:07:08] and will be temp [17:07:26] robh is on it [17:07:36] yeah, he is logged into that [17:07:42] and i need to tell him to kill his screen sessions [17:07:54] heh, screen sessions [17:08:33] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:09:52] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:10:13] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:11:53] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:12:42] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:14:12] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:16:22] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:19:29] 10Operations, 10Puppet, 10Readers-Web-Backlog (Tracking): Remove references to non-existent mfLazyLoadReferences cookies - https://phabricator.wikimedia.org/T175381#3592074 (10Jdlrobson) [17:19:48] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3592094 (10RobH) [17:19:50] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3592092 (10RobH) 05Open>03Resolved This change caused a wholly expected error, namely the new user had their uid used by the old us... [17:19:59] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10RobH) 05Open>03Resolved a:03RobH Fixed! [17:20:12] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:21:54] (03PS4) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) [17:24:34] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:28:53] (03CR) 10Dzahn: [C: 031] "let me know if today is good" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [17:31:03] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:32:03] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dsaez] [17:35:12] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:38:42] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:52] PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106) [17:40:12] RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:40:32] RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:45:23] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:48:38] (03CR) 10Thcipriani: [C: 031] Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [17:51:08] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3592267 (10madhuvishy) 05Open>03Resolved @fgiunchedi Thank you! That seems to have fixed it. Resolving this task. Thanks everyone :) [17:53:29] (03PS1) 10Chad: Drop PrivateSettings symlink, just include directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762 [17:54:06] (03PS10) 10Chad: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [17:57:50] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [18:01:05] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3592304 (10EBernhardson) cirrusSearchCheckerJob - basically idempotent. It verifies data in elasticsearch matches mysql, creates new... [18:10:43] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3592325 (10RHo) Awesome, thanks @Dzahn ! [18:12:01] (03PS1) 10Chad: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 [18:12:03] (03CR) 10Chad: [C: 032] Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad) [18:12:56] (03PS1) 10Dzahn: admins: add Kai Nissen (knissen) to LDAP (nda) users [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046) [18:13:52] (03Merged) 10jenkins-bot: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad) [18:16:11] (03CR) 10jenkins-bot: Fix pedantic spacing in phpcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376767 (owner: 10Chad) [18:17:08] (03CR) 10RobH: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046) (owner: 10Dzahn) [18:17:31] !log demon@tin Synchronized wmf-config/FeaturedFeedsWMF.php: no-op (duration: 00m 46s) [18:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:50] (03CR) 10Thcipriani: [C: 031] Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad) [18:18:56] (03CR) 10Dzahn: [C: 032] admins: add Kai Nissen (knissen) to LDAP (nda) users [puppet] - 10https://gerrit.wikimedia.org/r/376769 (https://phabricator.wikimedia.org/T168046) (owner: 10Dzahn) [18:30:02] mutante: Guten Tag. I got some notification from you but the irc client froze eventually [18:30:52] hashar: i was just merging the "upgrade git on zuul mergers change" and saying (to others watching it) to not worry - because it looked like an upgrade now.. but actually it already happened in the past [18:31:10] so i just said "dont worry.. etc.. see comment by hashar above" [18:31:30] mutante: yeah sorry I forgot to poke the Gerrit task when I did the upgrade manually [18:31:33] and then i confirmed the change on contint1001, it added the backports config [18:31:44] and apparently nothing happened since I did the change. So Iguess it is fine [18:31:44] but also the git package was already upgraded.. no-op [18:31:51] \o/ [18:32:07] yep :) [19:04:55] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3592485 (10aaron) >>! In T171071#3464513, @Marostegui wrote: > Hey, > > I would be nice to do a test with MariaDB 10.0 and 1... [19:07:43] (03PS2) 10Herron: WIP: icinga: add check_sysctl.sh script [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060) [19:25:24] (03PS1) 10Dzahn: site: remove unused virtual host 'zosma' [puppet] - 10https://gerrit.wikimedia.org/r/376779 (https://phabricator.wikimedia.org/T138650) [19:27:45] (03PS1) 10Dzahn: remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650) [19:37:47] !log removing ganeti instance 'zosma' on ganeti2001 (T138650) [19:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:00] T138650: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650 [19:38:45] (03CR) 10Dzahn: [C: 032] "removed on ganeti2001 - nobody except ops ever logged in here (there were no access groups)" [puppet] - 10https://gerrit.wikimedia.org/r/376779 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [19:39:23] PROBLEM - Host zosma is DOWN: PING CRITICAL - Packet loss = 100% [19:40:33] ACKNOWLEDGEMENT - Host zosma is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T138650 [19:43:06] !log zosma.codfw.wmnet - delete salt key, puppet node clean, puppet node deactivate, remove from Icinga,... (T138650) [19:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:18] T138650: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650 [19:49:33] (03PS2) 10Dzahn: remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650) [19:50:03] (03CR) 10Dzahn: [C: 032] remove unused VM 'zosma' [dns] - 10https://gerrit.wikimedia.org/r/376780 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [19:52:04] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3592768 (10Dzahn) @EddieGP Thanks, yea. I removed it. Should be all done now, also DNS. [19:52:23] 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3592771 (10Dzahn) [19:52:27] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3592770 (10Dzahn) 05Open>03Resolved [19:52:41] 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10Dzahn) [19:52:43] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2406322 (10Dzahn) 05Resolved>03declined [20:03:20] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3592820 (10Dzahn) 05stalled>03declined [20:06:37] (03CR) 10Chad: [C: 032] Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad) [20:08:19] (03Merged) 10jenkins-bot: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad) [20:08:29] (03CR) 10jenkins-bot: Remove git_repo config for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376743 (owner: 10Chad) [20:10:02] !log demon@tin Synchronized scap/scap.cfg: drop git_repo for now, T175041 (duration: 00m 46s) [20:10:11] !log heze (bacula storage) - installing BIOS upgrade (T162850) [20:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:17] T175041: scap sync failed on i18n - https://phabricator.wikimedia.org/T175041 [20:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:29] T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 [20:14:19] !log heze - rebooting for firmware upgrade [20:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:20] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3592861 (10Dzahn) [20:24:39] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/Flow/includes/: bugs and stuff (duration: 00m 54s) [20:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:30] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2047458 [20:30:17] no_justification: Adding or removing bugs? [20:30:19] 10Operations, 10monitoring: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3592881 (10herron) >! In T173427#3591926, @akosiaris wrote: > > It's about avoiding transient network failure positives and making sure the failure is not a transient one. This, but about puppet runs, is w... [20:30:44] Reedy: Well, I wasn't undeploying Flow so I guess adding? [20:30:47] ;-) [20:33:13] needs to reboot all 3 logstash servers.. but .. eh.. not now [20:35:11] mutante: Huh? [20:36:10] BIOS upgrade [20:36:23] (03PS1) 10MaxSem: Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 [20:36:29] because they are Dell R320 and there was that weird CPU throttling on those sometimes [20:36:48] i suppose i could depool one of them at a time, reboot, repool,, wait .. next one etc [20:37:05] but also .., stuff like https://wikitech.wikimedia.org/wiki/Service_restarts#Logstash [20:37:09] ? [20:37:34] this is just 1001-1003 though, not 1004-1006 [20:38:48] yea, wasnt really related to the gerrit thing, but somebody who knows logstash well would be nice for both, heh [20:39:18] mutante: you want guillaume ;) [20:39:30] 1004-6 are the new ones, not relly sure of the migration state [20:39:36] ah! [20:39:50] thanks volans, ok [20:40:43] mutante: actually I might be confusing the numbers? T175045 [20:40:43] T175045: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045 [20:40:50] now I'm lost :D [20:41:50] oops, eh yea :) [20:41:56] i'll ask [20:46:02] Wait, there's a new BIOS version to fix the CPU issue on R320s? [20:46:56] (03PS1) 10MaxSem: Add logging for email blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376792 (https://phabricator.wikimedia.org/T175419) [20:48:19] no_justification: yea, well.. we are installing 2.4.2 over 1.2.4 and havent seen one so far.. hope [20:48:45] maybe :) [20:49:36] (03CR) 10Kaldari: [C: 031] Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem) [20:50:30] (03PS11) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) [21:04:07] mutante: Tbf, we never managed to isolate which R320s did it anyway, so silence != fixed (sadly) [21:07:21] no_justification: we did to a certain extent https://phabricator.wikimedia.org/T162850#3179492 [21:07:42] well the ones that got tickets are linked [21:08:24] R320s with Jessie and a 4.x kernel. But I assume there's been others in that same set that haven't been affected? [21:08:25] * no_justification shrugs [21:08:42] there are not that many R320s [21:09:19] as opposed to other models [21:09:34] and only R320s have been affected. within their group the percentage is pretty high [21:09:46] (before upgrades) [21:10:10] * no_justification nods [21:23:34] hi, does anyone know where does logstash store things by default if you have not configured a dashboard for it [21:23:46] cc mutante ^^ [21:24:35] paladox: what are you looking for ? maybe i can help you find it (i assume logstash-beta aka the cloud services instance of logstash?) [21:25:14] Zppix it's logstash, /me and mutante are testing logstash for gerrit. But before anything is done in gerrit's config we are testing through the command line to see if it works [21:25:24] logstash prod [21:25:44] oh nevermind i cannot (prod logstash is outside of my access level) [21:26:17] do you have logstash not-prod? [21:26:36] nope [21:27:56] mutante: i can only access logstash-beta lol [21:28:21] well, that's not-prod :) [21:29:47] so yes [21:31:12] Zppix: it's about making a nice dashboard i guess [21:32:10] i dont know much about logstash other than to look at it to see how bad i screwed something up xD [21:32:38] ;) [21:43:01] PROBLEM - Host mc1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:10] this is me ^^^ [21:43:21] RECOVERY - Host mc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [21:49:11] volans: you should be off enjoying the weekend :) [21:49:51] chasemp: you got me! [21:54:42] !log gerrit2001 - restarting gerrit to test logstash config change (while not touching "live" gerrit) [21:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:22] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3593074 (10Pchelolo) [21:57:59] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10Pchelolo) Thank you @EBernhardson, updated the task with your info. Now we've got a complete list of jobs executed in pro... [22:19:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:20:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:22:13] (03Draft1) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input [puppet] - 10https://gerrit.wikimedia.org/r/376836 [22:22:16] (03PS2) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input [puppet] - 10https://gerrit.wikimedia.org/r/376836 [22:29:06] (03PS1) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [22:29:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:29:52] (03PS2) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [22:29:54] (03PS2) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [22:30:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:30:33] (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:33:34] (03Abandoned) 10Paladox: logstash: Enable ferm port 5229 and enable tcp input [puppet] - 10https://gerrit.wikimedia.org/r/376836 (owner: 10Paladox) [22:34:35] (03PS3) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [22:34:37] (03PS3) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [22:35:09] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:35:14] (03CR) 10jerkins-bot: [V: 04-1] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:37:20] (03PS4) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [22:37:22] (03PS4) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [22:37:59] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [23:09:20] !log phab1001, phab2001 - terminating 3 unused screen processes that were from iridium migration / data sync [23:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:27:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:32:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:38:12] !log netmon1002 - terminated unused screen session [23:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:55] (03PS1) 10EBernhardson: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 [23:42:29] (03CR) 10EBernhardson: [C: 032] Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson) [23:44:06] (03Merged) 10jenkins-bot: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson) [23:46:09] (03CR) 10jenkins-bot: Fix human search relvance survey sampling rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376852 (owner: 10EBernhardson) [23:46:22] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-rel-survey.php: T171740: Fix inverted sampling rates for human relevance survey (duration: 00m 47s) [23:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:36] T171740: [Epic] Search Relevance: graded by humans - https://phabricator.wikimedia.org/T171740