[00:01:32] (03PS1) 10Alex Monk: Add account creation throttle exception for event before next window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333468 (https://phabricator.wikimedia.org/T155877) [00:01:36] 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2958647 (10scfc) See also T153163, subtasks and associated commits (plus two or three that I have yet to publish, polishing code isn't fun :-)). [00:01:53] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2958653 (10scfc) [00:01:55] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2958652 (10scfc) 05Open>03Resolved [00:05:05] (03CR) 10Alex Monk: [C: 032] Add account creation throttle exception for event before next window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333468 (https://phabricator.wikimedia.org/T155877) (owner: 10Alex Monk) [00:06:36] (03Merged) 10jenkins-bot: Add account creation throttle exception for event before next window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333468 (https://phabricator.wikimedia.org/T155877) (owner: 10Alex Monk) [00:07:24] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:07:59] (03CR) 10jenkins-bot: Add account creation throttle exception for event before next window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333468 (https://phabricator.wikimedia.org/T155877) (owner: 10Alex Monk) [00:08:23] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/333468 - needed to be handled before next window (duration: 00m 42s) [00:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:27] bd808, ^ [00:08:52] thanks Krenair [00:10:19] They do a lot of workshops these days. [00:26:11] (03CR) 10Alex Monk: Tools: Use exported resources for ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [00:30:51] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2958672 (10Smalyshev) @Ricordisamoa For Wikidata Items, yes, you can do that. [00:32:40] (03PS1) 10Alex Monk: Move postgres tuning.conf into module [puppet] - 10https://gerrit.wikimedia.org/r/333470 [00:37:56] (03PS1) 10Alex Monk: Allow use of PuppetDB in labs for sshknowngen [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) [00:40:36] (03PS1) 10Alex Monk: ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) [00:42:36] (03PS5) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [00:42:54] (03CR) 10Paladox: ">" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [00:44:17] (03PS6) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [00:45:04] (03CR) 10Paladox: [C: 031] Phabricator: Fix phd init script, also use systemd script if the os is cable of it (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [00:45:18] (03CR) 10Paladox: "Tested sysvinit one and it worked." [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [00:46:14] (03PS1) 10Alex Monk: puppetdb: Allow tuning.conf to have a different shared_buffers value [puppet] - 10https://gerrit.wikimedia.org/r/333473 (https://phabricator.wikimedia.org/T72792) [00:49:08] (03PS2) 10Alex Monk: ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) [01:05:03] (03Draft1) 10Paladox: Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 [01:05:07] (03PS2) 10Paladox: Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 [01:05:59] (03CR) 10Paladox: "@Chad Should i do this through puppet or do it here?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [01:07:21] (03PS7) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [01:14:25] (03CR) 10Paladox: [C: 031] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [01:18:01] (03PS8) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [01:18:07] (03CR) 10Paladox: [C: 031] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [01:34:03] (03Draft2) 10TTO: Temporarily set $wgDisableUserGroupExpiry to true on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333476 [01:53:34] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.135 second response time [02:01:56] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [02:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:34] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.104 second response time [03:22:31] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2958800 (10Volans) [03:23:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 732.86 seconds [03:26:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.32 seconds [03:31:49] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2958424 (10Volans) Also puppet is broken because of this, is this //expected// @fgiunchedi ? Shouldn't we detect the failed disk and let puppet continue running instead? [03:31:49] ACKNOWLEDGEMENT - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 29 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1] Volans Broken disk: https://phabricator.wikimedia.org/T155907 [03:36:34] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:38:35] 06Operations, 10ops-eqiad: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2958806 (10Volans) @Cmjohnson: I think that @ArielGlenn might be the right person to ask to ;) [04:04:24] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [04:04:34] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:09:44] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [04:18:33] (03PS6) 10Juniorsys: role analytics_cluster: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332106 (https://phabricator.wikimedia.org/T93645) [04:21:39] (03CR) 10Juniorsys: "@Elukey Good point - I think this has been fixed now. While comma separated include syntax is valid, it is rarely seen in the wild, which " [puppet] - 10https://gerrit.wikimedia.org/r/332106 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [04:30:38] (03CR) 10Tim Landscheidt: [C: 031] "I did this way simpler in https://gerrit.wikimedia.org/r/#/c/329390/, relying on PostgreSQL's defaults." [puppet] - 10https://gerrit.wikimedia.org/r/333473 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [04:31:44] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:33:24] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:33:34] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:37:44] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:46:33] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10Krenair) It looks like what actually takes most of the time is rendering each of the 468 rows it has to display. [05:00:44] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [05:01:34] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:16:05] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2958917 (10Krenair) It seems we can bypass all of the django/horizon template stuff going on in each row render by implementing `render` on UpdateRow in puppet_tables.py, taking off a... [06:07:14] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:30:24] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:36:14] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:39:24] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [06:46:18] (03CR) 10Tim Landscheidt: [C: 031] "Didn't test, but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [07:00:51] (03CR) 10Tim Landscheidt: [C: 04-1] Tools: Use exported resources for ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [07:19:23] (03CR) 10Tim Landscheidt: [C: 04-1] "As mentioned in Iee22dab01af78f38103743644061e4387d254d12, IMHO this change should also add "if ! $::use_puppetdb { }" guards around the g" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [08:05:00] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2959003 (10Joe) I would suggest, a few things: # check if the data is too large to fit on memcache (hint: they probably are). If that's the case, save them in smaller data structures.... [08:19:44] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:48:44] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:31:44] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [09:36:14] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:59:44] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:04:14] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:37:57] 06Operations, 10ChangeProp, 10Mobile-Content-Service, 06Parsing-Team, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2959060 (10Joe) 05Open>03Resolved [10:38:00] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2959061 (10Joe) [10:38:39] 06Operations, 10Parsoid, 15User-Joe, 15User-mobrovac: Parsoid timing out or failing when trying to parse specific user page - https://phabricator.wikimedia.org/T155618#2948681 (10Joe) 05Open>03Resolved [11:18:55] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [11:19:45] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [11:22:40] /wion,/w [11:37:10] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown, 05MW-1.29-release-notes: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#2959089 (10FilipGCI) [11:45:35] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:35] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:15:02] (03PS1) 10MarcoAurelio: Amend import sources for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333487 (https://phabricator.wikimedia.org/T155922) [12:24:10] (03PS1) 10MarcoAurelio: Amend category collation for de.wikisource to 'uca-de-u-kn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333488 (https://phabricator.wikimedia.org/T155916) [13:03:19] (03Draft1) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333492 [13:03:36] (03PS2) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333492 [13:38:04] (03Abandoned) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333492 (owner: 10Paladox) [13:53:55] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:13:25] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:22:55] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:27:25] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:25] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:56:25] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:05:25] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [15:33:25] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:39:55] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [16:06:55] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:10:08] p858snake: thats still open since its not done. [16:10:21] so please leave open [16:46:15] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:25] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:15] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:24:45] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [17:32:25] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:52:45] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:54:15] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:15] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:44:40] 06Operations, 10MediaWiki-extensions-CentralNotice, 15User-Dereckson: Create /community-beacon alternative entry point - https://phabricator.wikimedia.org/T155929#2959875 (10Dereckson) [19:58:25] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2959924 (10Moushira) [21:34:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:35:56] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3082725 keys, up 83 days 13 hours - replication_delay is 0 [22:38:56] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:07:56] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:26:03] !log restbase deploying d1663345 - blacklist of a bot log page on enwiki [23:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log