[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161209T0000). Please do the needful. [00:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:17] see my comment ^^^^ about SWAT [00:00:35] yurik: Has it already been merged to core too [00:00:35] ? [00:00:41] The submodule bump? [00:00:53] here [00:01:04] ostriches, it has only been merged to master and wmf5 for the extension [00:01:09] we didn't touch core [00:01:11] Needs to go into core too. [00:01:15] I tried to do the bump, but the result was https://gerrit.wikimedia.org/r/#/c/326060/1 [00:02:09] Um... [00:02:10] ok? [00:02:14] #diditwrong [00:02:15] :p [00:02:20] kaldari: Reviewing yours [00:02:30] (03PS2) 10Chad: Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) [00:03:35] ostriches so much easyer reverting on stable-2.13, i am building it now [00:03:49] (03CR) 10Chad: [C: 032] Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) [00:04:08] MaxSem: PS2 looks ok [00:04:35] arr, I was till looking at ps1 [00:04:37] (03Merged) 10jenkins-bot: Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) [00:04:40] fucking gerrit [00:04:50] I mean it was in the URL ;p [00:05:55] yeah, about a month ago gerrit started remembering older PS number, instead of showing the latest [00:06:06] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: swat (duration: 00m 46s) [00:06:12] gerrit ppl are really outsmarting themselves with it ;( [00:06:14] kaldari: You're live ^ [00:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:30] yurik: Um, it's always done that when you're browsing a specific patch set # :) [00:06:54] ostriches, nah, i think it got introduced in the version that added editing beyond comments [00:06:54] ostriches: yay! [00:06:55] it should show orange now [00:07:04] to tell you your on an old patch. [00:08:23] well, I'm colorblind. not that I can't see orange, but I learned to pay little attention to colors [00:09:31] It should show on the top right. [00:09:47] where you click to change patchsets (bar where you press reply) [00:16:31] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2858961 (10Deskana) >>! In T110171#2858787, @greg wrote: > It's an explicit follow-up from an incident. These should be... [00:19:44] ostriches, is swat done? can I sync https://gerrit.wikimedia.org/r/#/c/326051/ ? [00:20:20] Yeah was just 1 patch [00:20:27] thx [00:22:44] !log maxsem@tin Synchronized php-1.29.0-wmf.5/extensions/JsonConfig: https://gerrit.wikimedia.org/r/#/c/326051/ (duration: 00m 46s) [00:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:01] yurik, ^ [00:23:17] awesome, thanks!!! [00:23:36] 07Puppet: Inconsistent groups for Git repositories with role::puppetmaster::standalone - https://phabricator.wikimedia.org/T152060#2858993 (10scfc) p:05Triage>03Normal a:03scfc [00:31:23] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 660 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4750794 keys, up 38 days 16 hours - replication_delay is 660 [00:31:23] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4754387 keys, up 38 days 16 hours - replication_delay is 614 [00:31:33] PROBLEM - Redis status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 4756569 keys, up 38 days 15 hours - replication_delay is 609 [00:32:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4752415 keys, up 38 days 16 hours - replication_delay is 650 [00:33:03] PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9480054 keys, up 38 days 15 hours - replication_delay is 623 [00:37:33] RECOVERY - Redis status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 4743919 keys, up 38 days 15 hours - replication_delay is 0 [00:38:23] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4738087 keys, up 38 days 16 hours - replication_delay is 0 [00:38:23] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4741397 keys, up 38 days 16 hours - replication_delay is 0 [00:39:03] RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9445496 keys, up 38 days 15 hours - replication_delay is 0 [00:46:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4739597 keys, up 38 days 16 hours - replication_delay is 0 [00:47:33] ostriches i will try to build tommror as i keep getting gc errors [00:47:37] out of memory errors [00:47:51] i will try and see if i can build it on another host and scp it to there. [00:47:58] unless you want to do it. [00:50:12] paladox: No rush on a revert, we can wait a bit [00:50:16] Have a good evening [00:52:43] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:05:52] ok [01:05:54] thanks [01:06:00] ostriches lol it's mornning [01:06:05] 01:05am [01:07:45] (03Abandoned) 10BryanDavis: logstash: dynamically rename object values [puppet] - 10https://gerrit.wikimedia.org/r/320441 (https://phabricator.wikimedia.org/T150106) (owner: 10BryanDavis) [01:08:21] (03PS3) 10BryanDavis: l10nupdate: aquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) [01:20:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 615 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4740542 keys, up 38 days 16 hours - replication_delay is 615 [01:20:43] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:32:13] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:32:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [01:45:19] ostriches i managed to copy the gerrit folder over to gerrit-test using apache [01:45:32] scp, ssh wont work for me, keeps saying something about permission denied. [01:53:31] ostriches im deploying it now [01:54:38] !log foreachwiki extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php [01:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:33] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4740160 keys, up 38 days 17 hours - replication_delay is 651 [01:55:33] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4743539 keys, up 38 days 17 hours - replication_delay is 648 [01:56:03] PROBLEM - Redis status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9447606 keys, up 38 days 17 hours - replication_delay is 648 [01:56:23] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4735028 keys, up 38 days 17 hours - replication_delay is 0 [01:56:23] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4738511 keys, up 38 days 17 hours - replication_delay is 0 [01:59:03] RECOVERY - Redis status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 9442179 keys, up 38 days 17 hours - replication_delay is 0 [01:59:37] ostriches it's started again with the reverted patch, could you please try logging in again [02:00:12] Oh nope [02:00:13] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:00:14] dosent work [02:00:15] Cannot assign user name "paladox" to account 19; name already in use. [02:00:20] when doing Paladox [02:00:44] but when i had all to lower case i could use both [02:00:48] but you coulden log in [02:01:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4736690 keys, up 38 days 17 hours - replication_delay is 0 [02:21:36] PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds [02:30:01] looks like only db1028 is affected on s7 [02:34:36] disk problem mayve? i don't have access atm [02:38:44] or perhaps cheduled job,incresed activiry started at around 1 adaics from grafana [02:55:53] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:03] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:10:33] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.74 seconds [03:19:02] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2859207 (10Huji) A user at FA WP also just tested it with a Yahoo! sender address and it wor... [03:22:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 715.12 seconds [03:24:53] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [03:27:13] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:03] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [03:31:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 186.10 seconds [03:38:10] kaldari: is it possible populateLocalAndGlobalIds.php is spamming centralauth ? we got a slave lag page [03:38:48] also traffic increased on s7 globally https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s7&var-role=All&from=1481233111803&to=1481254711803 [03:38:55] godog: Maybe, I'll kill it for now.... [03:39:48] kaldari: ok thanks! I'll keep an eye on it and see if that was the cause [03:40:43] only db1028 suffered though, the other slaves didn't have a problem with it [03:41:01] godog: killed it [03:42:56] yeah written rows are dropping [03:44:48] godog: that's not surprising, all the script does is write a lot of rows, but it's supposed to waitForSlaves after each batch of 1000. [03:45:42] it did about 4 million writes this afternoon before I killed it [03:47:44] kaldari: ah, db1028 is weighted at 0 in mw I wonder if that's related [03:53:08] I wonder if db1028 will get a chance of catching up on the lag now [03:53:48] kaldari: anyways thanks for killing it, so far my best lead is db1028 being at weight 0 and waitforslave not waiting for it [03:54:08] looks like the lag is leveling off at least (rather than continuing to climb) [03:55:55] and now actually went down a little bit [03:56:57] godog: thanks for pinging me, looks like we might be on the way back to normal now. [03:57:29] I was looking at the slave lag when I first started running the script, but hadn't checked it since [03:57:33] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:22] kaldari: no worries, yeah it took about 30m to page I think [03:58:55] hmm, now it went back up again [03:59:42] godog: https://tendril.wikimedia.org/chart?hosts=db1028&vars=seconds_behind_master&mode=value [04:01:26] indeed, I'm comparing it with e.g. db1062 in grafana [04:01:44] the "write query stats" panel, which it dropped for db1062 but not db1028 [04:04:10] looks like it might be still flushing to disk, writing a lot [04:04:26] yeah, maybe it still has a lot to catch up on [04:05:33] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:06:53] there we go [04:07:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4739124 keys, up 38 days 19 hours - replication_delay is 647 [04:07:35] RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:08:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2202.80 Read Requests/Sec=2093.40 Write Requests/Sec=8.50 KBytes Read/Sec=20758.40 KBytes_Written/Sec=55.20 [04:08:29] I'll wait a bit to make sure db1028 is ok [04:09:02] kaldari: I'll file a task [04:09:43] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:41] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859216 (10fgiunchedi) [04:14:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4731976 keys, up 38 days 19 hours - replication_delay is 0 [04:17:18] kaldari: LGTM now, logging off, not yet sure about the root cause tho. I guess the script can be held off for now? [04:18:13] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=24.38 Read Requests/Sec=0.40 Write Requests/Sec=4.10 KBytes Read/Sec=6.40 KBytes_Written/Sec=38.00 [04:25:33] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:39:43] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:48:43] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:33] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 76 failures. Last run 2 minutes ago with 76 failures. Failed resources (up to 3 shown): Package[nagios-plugins-basic],Package[apt-transport-https],Package[tree],Package[ngrep] [05:16:43] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:24:27] 06Operations, 06Discovery, 06Maps (Tilerator): Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2859266 (10Yurik) [05:30:22] 06Operations, 03Interactive-Sprint, 06Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2859274 (10Yurik) [05:30:33] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:09:53] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:12:43] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:23:23] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:28:43] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:30:33] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jq] [06:38:53] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:40:43] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:54:23] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:56:43] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:33] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:58:33] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:12:43] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:13] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:26:33] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:31:15] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859340 (10Marostegui) Thanks guys for taking care of this. A quick HW check reveals no issue with db1028, just to discard issues.... [07:39:28] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859343 (10Peachey88) [07:39:53] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:41:43] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:42:34] !log Stop MySQL db2034 and db2048 for maintenance - T149553 [07:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:45] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [08:00:59] (03PS1) 10Yuvipanda: labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 [08:01:12] (03PS2) 10Yuvipanda: labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 [08:01:15] (03CR) 10jenkins-bot: [V: 04-1] labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 (owner: 10Yuvipanda) [08:02:50] (03CR) 10Yuvipanda: [C: 032] labsdb: Fixup maintain-dbusers.sql [puppet] - 10https://gerrit.wikimedia.org/r/326076 (owner: 10Yuvipanda) [08:08:53] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:21:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) [08:26:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [08:26:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326077 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [08:29:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T150644 (duration: 02m 10s) [08:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:55] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [08:39:12] !log Deploy alter table S5 wikidatawiki.revision on db1082 - T150644 [08:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:24] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [08:41:04] I'm going to be afk for a little while, I have an errand to run [08:45:10] ACKNOWLEDGEMENT - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 436249921 for key PRIMARY on query. Default database: commonswiki. Query: [snipped]2 Marostegui T152766 [09:07:41] (03PS1) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) [09:10:40] (03PS2) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) [09:11:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [09:16:53] (03PS3) 10Marostegui: mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) [09:18:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:25:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [09:29:32] 06Operations, 06Labs: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767#2859419 (10Volans) [09:30:51] (03PS1) 10Volans: Add missing Hiera for labspuppetbackend_mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/326082 (https://phabricator.wikimedia.org/T152767) [09:33:15] (03CR) 10Volans: [V: 032 C: 032] Add missing Hiera for labspuppetbackend_mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/326082 (https://phabricator.wikimedia.org/T152767) (owner: 10Volans) [09:40:44] (03PS1) 10Elukey: Remove the role eventlogging from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/326083 (https://phabricator.wikimedia.org/T152621) [09:41:51] (03CR) 10Marostegui: "All look good: https://puppet-compiler.wmflabs.org/4845/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:43:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:51] (03CR) 10Marostegui: [C: 032] mariadb: Added calculation for gtid_domain_id [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/326080 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:44:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [09:44:49] (03CR) 10Elukey: [C: 032] Remove the role eventlogging from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/326083 (https://phabricator.wikimedia.org/T152621) (owner: 10Elukey) [09:47:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:33] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:47:53] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:48:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [09:48:57] 06Operations, 06Labs, 13Patch-For-Review: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767#2859467 (10Volans) p:05High>03Normal a:05Volans>03None I've quickly added the missing one, the old one `labspuppetbackend::mysql_password` is still there and `hieradata/... [09:54:23] (03PS1) 10Elukey: Add the eventlogging admins back to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/326085 (https://phabricator.wikimedia.org/T152621) [09:59:16] (03PS1) 10Marostegui: mariadb: Added gtid_domain_id to its own variable [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) [09:59:33] (03CR) 10Elukey: [C: 032] Add the eventlogging admins back to eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/326085 (https://phabricator.wikimedia.org/T152621) (owner: 10Elukey) [10:01:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [10:02:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:52] (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/4848/" [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:03:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:03:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:03:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [10:04:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4721978 keys, up 39 days 1 hours - replication_delay is 0 [10:05:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:05:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:08:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:08:52] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 [10:09:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:10:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:10:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:10:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 (owner: 10Marostegui) [10:10:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326090 (owner: 10Marostegui) [10:11:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:11:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [10:11:46] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2855084 (10elukey) Last action left: removing unnecessary hiera data belonging to the eventlogging role (that is r... [10:12:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 - T150644 (duration: 00m 46s) [10:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [10:13:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:15:53] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:18:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:19:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [10:22:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [10:25:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:26:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [10:52:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:53:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [10:55:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [10:56:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [10:59:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [10:59:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:04] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:53] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:01:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:02:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:03:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:04:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:05:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:05:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) [11:07:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:07:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [11:08:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:08:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:08:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326101 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [11:09:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:09:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:09:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T150644 (duration: 00m 46s) [11:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:30] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [11:10:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:10:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:13:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:56] !log Deploy alter table s5 wikidatawiki.revision on db1087 - T150644 [11:14:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:14:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:14:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:03] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:03] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:03] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:20:03] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:20:03] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [11:20:03] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:20:03] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:03] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:03] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:04] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:04] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:05] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:05] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:06] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:53] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:20:54] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:21:03] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:21:03] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:21:04] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:21:53] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:21:53] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [11:22:03] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:22:03] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:23:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:23:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:23:03] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:03] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:14] !log upgrading cache_upload to varnish 4.1.4-1wm1 [11:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:03] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [11:24:03] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:26:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:03] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:03] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:03] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:03] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:03] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:03] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:03] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:03] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:29:03] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:29:03] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:03] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:03] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:03] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:03] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:04] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:53] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:29:53] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:29:54] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [11:30:03] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:30:03] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [11:31:03] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [11:31:03] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:31:03] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:31:03] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:31:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:31:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:32:03] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:32:03] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [11:33:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:33:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:33:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:33:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:33:03] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:34:03] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [11:34:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:34:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:34:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:36:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:37:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:38:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:40:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:41:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:41:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:43:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:43:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:46:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:46:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:48:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:57] !log mobrovac@tin Starting deploy [changeprop/deploy@9a33bf4]: (no message) [11:49:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:49:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:48] !log mobrovac@tin Finished deploy [changeprop/deploy@9a33bf4]: (no message) (duration: 00m 51s) [11:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:50:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:50:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:50:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:50:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:50:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:52:53] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:36] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859571 (10jcrespo) @kaldari I do not see long-running script being referenced on https://wikitech.wikimedia.org/wiki/Deployments#... [11:59:10] 06Operations: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2859574 (10Volans) [12:00:57] !log scb stopping changeprop in eqiad to investigate outage [12:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:41] !log https://grafana-admin.wikimedia.org/dashboard/db/api-requests Made the template variable for MediaWiki.api.main.executeTiming. to be refreshed on dashboard load (that is for the pXX entries) [12:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:35] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2859600 (10Addshore) As the messages now seem to be appearing this can be rolled out to mw.org on monday. [12:03:43] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:03:53] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:04:13] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:04:13] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:04:13] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:04:13] PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:04:33] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:04:33] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:08:00] ACKNOWLEDGEMENT - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage [12:08:00] ACKNOWLEDGEMENT - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac CP stopped to investigate MW API outage [12:08:01] ACKNOWLEDGEMENT - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac CP stopped to investigate MW API outage [12:08:09] bblack: https://phabricator.wikimedia.org/T142944#2782663 Would this suffice (for a first trial at least)? [12:16:39] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1189.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2']) [12:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:27] 06Operations, 10ops-eqiad: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2859620 (10Volans) [12:21:33] PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [12:22:33] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 74312 bytes in 0.125 second response time [12:22:53] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:23:07] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1189.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2']) [12:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:27:33] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74313 bytes in 0.138 second response time [12:27:53] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.031 second response time [12:27:53] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [12:31:53] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:44:13] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [12:44:23] !log scb re-enabled changeprop [12:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:33] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [12:44:33] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [12:44:43] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [12:44:54] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [12:45:13] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [12:45:13] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [12:45:13] RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy [12:50:25] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1289.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2']) [12:50:29] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1290.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=api_appserver', 'service=apache2']) [12:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 [12:58:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 (owner: 10Marostegui) [12:58:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326109 (owner: 10Marostegui) [12:59:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T150644 (duration: 00m 45s) [12:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:58] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [13:00:53] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:02:01] (03PS1) 10Yuvipanda: labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 [13:20:34] (03PS2) 10Yuvipanda: labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 [13:20:42] (03CR) 10Yuvipanda: [V: 032 C: 032] labs: More fixups to labsdbaccounts db [puppet] - 10https://gerrit.wikimedia.org/r/326111 (owner: 10Yuvipanda) [13:27:34] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:32] (03PS1) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 [13:29:49] 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 2 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859704 (10hashar) [13:30:53] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:31:13] (03PS1) 10Volans: Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) [13:31:46] marostegui: is that you ^^^ (puppetmaster1001) [13:31:48] (03PS2) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 [13:32:07] volans: don't think so [13:32:14] volans: let me double check [13:32:32] the missing puppet merge [13:32:52] (03PS3) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 [13:33:09] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [13:33:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [13:33:11] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [13:33:12] volans: nope, my change has not been submitted: https://gerrit.wikimedia.org/r/#/c/326086/ [13:33:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [13:33:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [13:33:14] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [13:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:53] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [13:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:56] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [13:33:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:50] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [13:34:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [13:34:54] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [13:35:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [13:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:05] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [13:35:07] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [13:35:08] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [13:35:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [13:35:11] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [13:35:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [13:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:18] !log depool fully scb1003, scb1004 T150882 [13:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:38] (03PS4) 10Yuvipanda: labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 [13:35:46] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Differentiate between legacy grants and newer labsdbs [puppet] - 10https://gerrit.wikimedia.org/r/326113 (owner: 10Yuvipanda) [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:41] T150882: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882 [13:39:13] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [13:39:13] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:41:06] !log poweroff scb1003, scb1004 [13:41:13] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859725 (10akosiaris) Depooled and shutdown scb1003, scb1004. Scheduled downtime in icinga as well [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [13:41:43] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:42:17] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859726 (10akosiaris) @Cmjohnson The servers are ready for their thermal paste treatment. [13:42:31] uh oh [13:42:33] I'll look [13:43:28] (03PS2) 10Volans: Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) [13:45:50] (03CR) 10Volans: [C: 032] Raid handler: force check_nrpe over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/326115 (https://phabricator.wikimedia.org/T152774) (owner: 10Volans) [13:47:13] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - create-dbusers is active [13:47:13] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [13:47:27] !log disable puppet on db1047, db1046 and dbstore1002 in preparation for restarts T152188 [13:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:39] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [13:47:58] YuviPanda: I /thnk/ it's died once before for a similar ldap restarted in the middle of the long ldap query via the mem leak cron [13:48:18] as-is iirc it connects to both ldap servers and round robins the queries to get all users and handles a disconnect poorly [13:48:37] chasemp: nope, this is just me. my last patch broke it [13:48:44] ah :) [13:48:46] chasemp: new patch coming up [13:48:53] (03PS1) 10Yuvipanda: labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 [13:49:03] chasemp: we should probably not be running it in two places tho [13:49:41] well that's a question I imagine, of whether it will be idempotent and it's not a big deal or whether it will race condition against itself etc [13:50:48] chasemp: idempotent against *two* simultaneous processes is going to be hard - you've to count every single line of code you write as if there could be a race somewhere else. ANd idk if it buys us anything at all? [13:51:10] chasemp: idempotent != concurrent, I ugess [13:51:14] I think the current implementation it's god the possiblity of bad news but I would like to solve it in a way that's not $active_host switches in puppet if possible [13:51:14] *guess [13:51:22] s/god/got heh [13:51:49] chasemp: what do you have in mind? [13:52:51] right now the mechanism for who is active is controlled via the nfs-manage script and ideally that's the canonical interface, so possibly a marker that derives from that which could be the cluster IP that is only avail on the active or a more explicit drop file [13:53:01] the idea being that when you down or up w/ that it's authoritative [13:53:05] chasemp: that sounds good to me [13:53:12] chasemp: can you write that up on the ticket? [13:53:13] esp considering time constraints on NFS failure being graceful [13:53:15] yessum [13:53:32] I've never seen the nfs-manage script, so I might need some help there. but that's going to be a while anyway [13:53:38] YuviPanda: that main goal one? [13:54:02] sure it's not at all complex, it's basically a procedural on what to bring up for the full stack [13:54:05] chasemp: I mean, not today :D I will probably get the script done today, but not deploy it [13:54:21] cool [13:54:22] chasemp: there's already a 'canonical store' of all the user accts on m5 :D just user creation left now [13:54:52] I haven't figured out if I want this to run on a cron or as a daemon tho [13:54:53] awesome, how did farming the existing creds work out? [13:55:08] I went down that road adn came to the conclusion we stink at monitoring crons [13:55:29] chasemp: took me a while to sort out the edge cases. everything is 'stable' now [13:55:33] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:55:40] that's why I've grudgingly accepted the nfs-exportd etc as a 'timer' [13:56:00] yeah, that's why I've been writing them to be in a looping process [13:56:12] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2828786 (10Vriullop) I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile... [13:56:21] (03PS2) 10Yuvipanda: labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 [13:56:29] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Fixup errors in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/326118 (owner: 10Yuvipanda) [13:57:13] !log upgrading cache_text to varnish 4.1.4-1wm1 [13:57:16] YuviPanda: cool man, good looking on the possibility of race condition and general issue there, I'll drop a note on the task today [13:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:43] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 [13:59:45] (03PS1) 10Volans: Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) [14:00:29] Hi ops-team, the eventLogging error is me stopping this consumer for a DB restart [14:00:31] (03PS2) 10Volans: Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) [14:00:40] I'll try to acknowledge in icinga [14:01:15] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 Jcrespo stopped for mysql restart [14:01:48] Thanks jynus [14:01:51] (03CR) 10Volans: [C: 032] Raid handler: parse arguments before setup logging [puppet] - 10https://gerrit.wikimedia.org/r/326119 (https://phabricator.wikimedia.org/T152774) (owner: 10Volans) [14:03:53] !log setting db1046, db1047, dbstore1002 in read-only mode/stopping replication [14:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:16] !log restarting db1046 T152188 [14:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [14:08:07] 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2859757 (10mobrovac) [14:08:43] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [14:09:33] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - create-dbusers is active [14:11:22] (03PS1) 10Jcrespo: analytics-mariadb: Enable new certificates on eventlogging servers [puppet] - 10https://gerrit.wikimedia.org/r/326122 (https://phabricator.wikimedia.org/T152188) [14:13:16] (03CR) 10Jcrespo: [C: 032] analytics-mariadb: Enable new certificates on eventlogging servers [puppet] - 10https://gerrit.wikimedia.org/r/326122 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [14:14:12] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2859773 (10mobrovac) >>! In T151702#2859748, @Vriullop wrote: > I try to expand it from the wiki part. A page with this template transclusion was rendered in... [14:20:22] (03PS1) 10Jcrespo: eventlogging-mariadb: Add new TLS certs to eventlogging severs [puppet] - 10https://gerrit.wikimedia.org/r/326123 (https://phabricator.wikimedia.org/T152188) [14:20:53] (03CR) 10Jcrespo: [C: 032] eventlogging-mariadb: Add new TLS certs to eventlogging severs [puppet] - 10https://gerrit.wikimedia.org/r/326123 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [14:27:21] (03PS1) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [14:28:39] (03PS2) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [14:32:13] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:32:43] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [14:33:40] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4849/" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [14:35:13] (03PS3) 10Marostegui: check_mariadb.pl: Fixed small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [14:35:50] (03CR) 10Marostegui: "The new script can be tested at dbstore1001:/home/marostegui" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [14:39:23] !log restarting db1047 T152188 [14:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:36] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [14:39:44] (03PS1) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 [14:40:27] (03PS2) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 [14:40:32] 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859816 (10Anomie) [14:41:22] (03CR) 10jenkins-bot: [V: 04-1] labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 (owner: 10Yuvipanda) [14:42:22] (03PS3) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 [14:42:45] (03PS4) 10Yuvipanda: labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 [14:44:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:46:19] (03CR) 10Jcrespo: "On a maintenance window, will look at it later, this need careful review because otherwise we could leak private data." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [14:46:34] (03CR) 10Yuvipanda: [C: 032] labsdbs: add new labsdbs to config [puppet] - 10https://gerrit.wikimedia.org/r/326125 (owner: 10Yuvipanda) [14:48:21] (03PS4) 10Marostegui: check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [14:48:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:48:52] (03CR) 10Marostegui: "Sounds good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [14:49:39] !log restarting dbstore1002 T152188 [14:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [14:50:13] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180047.91 seconds [14:51:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:52:13] ^marostegui, dbstore1001 is that you? [14:52:24] no, I haven't done anything [14:52:36] ah, I know what it is [14:52:39] I think I ack'ed in the morning the replication break one but maybe not the lag one [14:52:46] the lag alert is very large there [14:52:59] so it is a 24-hour delayed alert [14:53:01] :-) [14:53:28] large here means, the lag offset is allowed to go very far [14:54:44] will handle that in a second [14:54:52] once I fix analytics [14:57:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:57:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:58:14] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2859855 (10mehtab.ahmed) @Aklapper : waiting for reply. [15:00:53] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:13] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:04:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:04:48] 06Operations: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2859863 (10Volans) Related to T142430 [15:12:12] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859881 (10GWicke) [15:13:35] (03PS1) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 [15:14:22] (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) [15:15:45] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2859905 (10Lydia_Pintscher) Hey :) We'd really like to move forward with making the ArticlePlaceholder more useful. It not showing... [15:16:12] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2859907 (10Deskana) [15:16:15] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2859906 (10Deskana) 05Open>03Resolved [15:16:18] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859908 (10mobrovac) [15:16:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:16:51] (03PS2) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 [15:17:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:17:53] (03CR) 10jenkins-bot: [V: 04-1] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) [15:20:03] (03PS3) 10Ema: varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 [15:21:37] (03PS1) 10Hoo man: Load the property order from Wikidata per default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326133 (https://phabricator.wikimedia.org/T149540) [15:21:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:24:06] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 182027.99 seconds Jcrespo https://phabricator.wikimedia.org/T152766 [15:25:26] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859881 (10EBernhardson) Poking through the 'Visualize' tab, kibana 4 reports having both standard and date based histogr... [15:25:38] !log reedy@terbium$ time mwscript refreshImageMetadata.php --wiki=testwiki --force --mediatype=BITMAP | tee /tmp/refreshtestwikiimages.log [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:53] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:32:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:32:49] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2859969 (10GWicke) The workflow we were using is more along the lines of: 1) Query some subset of log entries, typically... [15:34:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [15:38:43] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:33] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [15:39:34] (03CR) 10Jcrespo: ".*$ ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [15:40:53] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857.00 seconds [15:40:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 692.07 seconds [15:41:13] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1062.75 seconds [15:43:53] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 132.99 seconds [15:43:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 207.08 seconds [15:44:36] the downtime expired [15:44:45] just in time to get solved [15:46:03] db1047 should be solved in a minute [15:47:13] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:48:31] (03CR) 10Ema: [C: 032] varnish: add varnish hitrate dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/326132 (owner: 10Ema) [15:48:58] (03PS5) 10Marostegui: check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [15:49:27] (03CR) 10Marostegui: check_mariadb.pl: Fix small display issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [15:55:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:56:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:56:53] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:57:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:59:13] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 1.81 seconds [16:00:18] (03CR) 10Jcrespo: [C: 031] check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [16:00:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:02:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:03:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:05:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:05:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:05:43] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:07:23] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:16:13] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:21:32] !log powering off scb1003 for thermal paste replacement [16:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:53] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:28] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 5 others: Support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860017 (10mark) I just found an HHVM issue which suggests that max_execution_time is ignored in FCGI mode: https://github.com/facebook/hhvm/is... [16:25:53] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:33:22] (03CR) 10Jcrespo: [C: 031] "My recommendation would be to reploy on Monday. Last thing I want is to deploy a friday afternoon a new alerting logic and create a page s" [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [16:35:40] ^ jynus yes, no way I am deploying that now :) [16:36:00] (03PS1) 10Mark Bergsma: Set hhvm.server.request_timeout_seconds to 60s [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) [16:42:46] 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2860051 (10Cmjohnson) Both servers have had their thermal paste removed and replaced. [16:46:15] 06Operations, 05Prometheus-metrics-monitoring: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#2804471 (10faidon) I don't think we should mess with the system's PAM config for this -- that's going to be a dangerous change, especially in the long run. [16:47:01] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2860054 (10Cmjohnson) [16:51:53] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:03:13] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:13] PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed [17:07:23] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:33] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:03] (03CR) 10GWicke: [C: 031] Set hhvm.server.request_timeout_seconds to 60s [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [17:24:23] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [17:25:13] RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active [17:25:45] (03CR) 10Anomie: "Checking api.log, I see a fair number of requests that would be affected by this if it goes by wall time. Half them seem to be action=pars" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [17:26:03] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:53] (03CR) 10Mark Bergsma: "> Checking api.log, I see a fair number of requests that would be" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [17:32:13] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:37:23] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:41:33] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:53:43] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:55:03] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:59:13] PROBLEM - carbon-cache@b service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed [17:59:23] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:53] (03CR) 10Volans: "In general LGTM, but I have zero knowledge of gdnsd stats ;)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [18:01:58] (03CR) 10GWicke: [C: 031] "I really think we need to prioritize system stability and availability over allowing very expensive requests to consume server side resour" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [18:03:59] (03PS1) 10Volans: Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) [18:04:45] (03PS2) 10Volans: Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) [18:07:33] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:18] (03CR) 10Filippo Giunchedi: [C: 031] Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) [18:11:12] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: Support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860191 (10Anomie) Tried my code from T97192#1237258 with `hhvm.server.request_timeout_seconds`. Even though changing it in `/etc/hhvm/fcgi.ini... [18:11:13] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:14] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:18:53] (03CR) 10Volans: "Puppet compiler results: https://puppet-compiler.wmflabs.org/4851/" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) [18:19:47] (03CR) 10Volans: [C: 032] Revert "Add python-confluent-kafka to eventlogging::dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/326148 (https://phabricator.wikimedia.org/T142430) (owner: 10Volans) [18:21:43] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:22:39] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2860212 (10fgiunchedi) [18:22:43] (03PS3) 10Anomie: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 [18:22:54] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:23:37] yay, finally stat1002 works [18:24:13] RECOVERY - carbon-cache@b service on graphite1003 is OK: OK - carbon-cache@b is active [18:24:23] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [18:26:15] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2860228 (10Volans) given @Ottomata was out today we decided to revert the last change that added that package to unblock Puppet on `stat1002`. On all the hosts in which it wa... [18:30:05] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2860252 (10GWicke) [18:34:55] (03Draft1) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) [18:34:57] (03Draft2) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) [18:35:33] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:36:06] ostriches ^^, that is a prep patch in case we decide to do that but that will fix one problem, we now just need to pull stable-2.13 and cherry-pick those patches that fixes external_id [18:36:07] https://gerrit-review.googlesource.com/#/c/92830/ [18:36:32] can I have someone run fixNamespaceDupes to check a li'l thing? [18:38:30] running scripts with 'mwscript' will execute them with PHP, and not HHVM, correct? [18:38:50] bd808: you probably know ^ [18:39:47] paladox: That would go after the fix + reindex. Then we'll try the patch [18:39:55] Yep [18:40:06] I put a note in there that it needs your +1 [18:42:23] ostriches are you pulling from upstream (stable-2.13)? [18:42:29] Yep [18:42:36] I'm building stable-2.13 right now [18:42:41] The changes got merged [18:42:42] Ok [18:42:43] thanks [18:42:59] are you going to cherry-pick https://gerrit-review.googlesource.com/#/c/92830/ [18:43:00] ? [18:43:23] I don't *think* it'll be necessary. [18:43:29] We could I suppose [18:43:35] We already got these 5: [18:43:37] 082f939324 Fix eviction order when linking new external ids [18:43:38] 8f83bbbb27 AccountManager#create: Do not overwrite external ID of other account [18:43:38] ca547ff308 Add REST endpoint to reindex a single account [18:43:38] 5862be6274 Revert "AccountManager: Check that ext ID belongs to account before delete" [18:43:38] 352be569f9 AccountManager: Check that ext ID belongs to account before delete [18:43:46] (03PS3) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [18:44:01] Oh [18:46:14] I think the other fixes make 92830 unnecessary [18:46:44] oh [18:46:52] That being said, it couldn't hurt. [18:47:04] Yep [18:47:30] rebuilding, shouldn't take long [18:47:36] Yep :) [18:47:39] (03PS1) 10EBernhardson: Don't retry InitImageDataJob's [puppet] - 10https://gerrit.wikimedia.org/r/326151 [18:47:55] Please tell me why we have to rebuild Documentation/licenses every time, even when it doesn't change :p [18:47:57] So slowwwww [18:48:14] Oh well [18:48:45] Yeh [18:48:46] so slow [18:49:06] and slower when approach the end [18:49:45] LOL, yeh reindex may be quicker this time or not [18:49:51] It should be [18:49:56] I think I can *just* reindex accounts [18:50:01] Have to check [18:50:01] yep [18:50:03] (03PS1) 10Rush: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 [18:50:06] Then yeah, should be fast [18:50:11] It's changes that are slow, not accounts [18:50:31] Yep [18:50:44] (03CR) 10jenkins-bot: [V: 04-1] WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) [18:52:12] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2860328 (10Aklapper) Feel free to propose in a separate task. :) [18:52:13] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:22] (03PS2) 10Rush: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 [18:54:11] bd808: ok, i now see that mwscript uses php5 and not hhvm. why? D: [18:54:21] (03CR) 10jenkins-bot: [V: 04-1] WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) [18:54:56] i see. 8f8e7dbdd834066504e59edfc4881bb98f76072a D: [18:55:12] why can nothing ever just work :( [18:55:34] (03PS1) 10Chad: gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 [18:55:54] (03PS1) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [18:56:32] MatmaRex: mwscript *could* use hhvm, but it would be slower [18:56:42] (03CR) 10Chad: "Dunno if that's the right way to format "minor" bullet points. Google says yes I'm not sure if "minor" is the right way to do a list as a " [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (owner: 10Chad) [18:56:47] bd808: context is https://phabricator.wikimedia.org/T32961#2860324 [18:56:52] how much slower is an "it depends" question [18:56:53] (03CR) 10Paladox: "Probaly want to add Bug: T152640" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (owner: 10Chad) [18:56:55] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [18:57:01] bd808: php is buggy for my use case, and hhvm probably isn't [18:57:32] bd808: isn't hhvm supposed to be better for long-runinng maintenance scripts? :/ [18:57:42] MatmaRex: care to guess how many times it would got the other direction? ;) [18:58:19] we could probably add a flag to mwscript that made it possible to use hhvm [18:58:21] bd808: oh, i'm sure plenty, just a couple weeks ago i was porting fixes from php to hhvm [18:58:22] bd808: would you run namespaceDupes.php maintenance script for me for https://phabricator.wikimedia.org/T152793 ? [18:58:32] but just today, php happened to be worse [18:58:34] or you can just as easily craft the correct command to do so yourself [18:58:39] (and only because we're running an old version) [18:59:37] MarcoA: I could probably do that. cawikiquote? [18:59:46] bd808: yep [18:59:56] bd808: hmm, if you're here, wanna run a maintenance script on testwiki for me afterwards? ;) [19:00:19] mwscript namespaceDupes.php --cawikiquote --fix ? [19:00:29] not sure about the syntax [19:00:51] should be on terbium [19:01:46] --wiki=cawikiquote [19:01:52] !log Ran `mwscript namespaceDupes.php cawikiquote --fix` for T152793 [19:01:52] or just no --wiki= [19:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:05] T152793: Fix namespaces in ca.wiktionary - https://phabricator.wikimedia.org/T152793 [19:02:42] MarcoA: can you check to see if that fix you problem? [19:02:46] sure [19:02:58] MatmaRex: what's you honeydo? [19:02:59] (03PS2) 10Filippo Giunchedi: gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (https://phabricator.wikimedia.org/T152640) (owner: 10Chad) [19:03:08] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] gerrit (2.13.3-wmf.2) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326154 (https://phabricator.wikimedia.org/T152640) (owner: 10Chad) [19:03:25] bd808: doubleredirects now list that page as stricked, so it should be fixed [19:03:29] thanks a bunch [19:03:32] yw [19:04:35] bd808: ok, nevermind, i'm getting Reedy to do it. thanks :) [19:04:53] Because buying beers in GBP is cheaper than USD? ;) [19:04:58] haha [19:06:53] I think we've got a problem [19:07:12] if you type now VD:AJUDA, it redirects you to https://ca.wikiquote.org/wiki/Especial:GoToInterwiki/vd:AJUDA [19:07:28] hmm [19:08:23] bd808: Didn't we make HHVM CLI not use JIT stuff? [19:09:09] (03PS2) 10RobH: adding new shell users arnad & jgonsior [puppet] - 10https://gerrit.wikimedia.org/r/325868 (https://phabricator.wikimedia.org/T152023) [19:09:17] !log powering down db1073 to apply thermal paste https://phabricator.wikimedia.org/T149728 [19:09:28] (03CR) 10RobH: [C: 032] adding new shell users arnad & jgonsior [puppet] - 10https://gerrit.wikimedia.org/r/325868 (https://phabricator.wikimedia.org/T152023) (owner: 10RobH) [19:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:32] Reedy: bd808: fwiw, mwscript was switched to use php5 in this commit 10 months ago: https://gerrit.wikimedia.org/r/#/c/267816/ [19:09:35] Reedy: ... don't remember. We did something to make it not pre-fork an exec thread pool [19:10:07] yeah... I think that bug should be fixed now [19:10:15] that was the pre-fork pool thing [19:12:14] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: adrian bielefeldt & julius gonsior shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T152023#2860391 (10RobH) 05Open>03Resolved No objections were noted, so this has been merged live. It will ta... [19:12:35] MarcoA: so did namespaceDupes.php mess something up? [19:12:51] bd808: not sure what might have happened [19:13:06] Dereckson: you there? [19:13:45] MarcoA: I'm going to step away for food, but poke Reedy if you find something that needs help while I'm gone. [19:13:54] MarcoA: yup [19:13:59] How can I help you? [19:14:01] bd808: thanks [19:14:20] Dereckson: we've just run namespaceDupes at cawikiquote and it seems something went wrong [19:14:38] * Dereckson checks [19:14:43] you've the output on a pastebin? [19:15:09] it's on Phab [19:15:44] T152793 [19:15:45] T152793: Fix namespaces in ca.wiktionary - https://phabricator.wikimedia.org/T152793 [19:15:49] (03PS2) 10RobH: new shell user piccardi [puppet] - 10https://gerrit.wikimedia.org/r/325869 (https://phabricator.wikimedia.org/T151969) [19:16:21] !log upload gerrit_2.13.3+git1-wmf.1 to carbon - T152640 [19:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:35] T152640: Cannot log into Gerrit as of recent upgrade - https://phabricator.wikimedia.org/T152640 [19:16:38] ostriches: ^ it is taking forever to clone gerrit locally, I'll send the patch later [19:16:41] it seems there's no NamespaceAliases defined but that shouldn't be an issue - it gave Title error before running the script [19:16:46] (03CR) 10RobH: [C: 032] new shell user piccardi [puppet] - 10https://gerrit.wikimedia.org/r/325869 (https://phabricator.wikimedia.org/T151969) (owner: 10RobH) [19:17:11] godog: Okie dokie, I'm seeing the new package after an update on cobalt [19:17:32] vd: is an interwiki defined in the interwiki map :S [19:17:50] MarcoA: okay, I'm looking [19:18:09] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969#2860421 (10RobH) 05Open>03Resolved No objections were noted, so this has been merged live. It will take up to 30 minutes... [19:18:50] I'd suggest we remove the namespace [19:18:52] !log gerrit: bringing offline for just a minute or two for bug fix upgrade, T152640 [19:18:52] alias [19:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:11] and we run again namespacesDupes, it will move the page to main [19:19:15] Dereckson: which one? there's no alias [19:19:20] VD: [19:19:34] you mean, remove it from the interwiki map? [19:19:48] no no, I thought VD: was also an alias for a ca.wikiquote namespace [19:19:54] nope [19:20:12] VD: should just use the main namespace [19:20:15] ok [19:20:25] but I think the interwiki map is messing there [19:20:39] my suggestion is to remove vd: from the interwiki map [19:21:00] let's try through the api to rename this page [19:21:13] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:22:30] !log gerrit: back up, including new fixes. Users will have to re-login, sorry :) [19:22:33] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [19:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:54] ostriches i now have the problems [19:22:57] with signing in [19:22:58] lol [19:22:59] Cannot assign user name "paladox" to account 4335; name already in use. [19:23:03] .... [19:23:03] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki] [19:23:06] i used paladox which worked up until now [19:23:21] ostriches i carn't login now [19:23:33] Using Paladox dosent work either [19:23:36] that sound i just heard was chad's sanity snapping. [19:23:59] Interesting, mine's working case-insensitive now too [19:24:06] MarcoA: bd808: VD-PMF alraedy exists, that's why it didn't renamed it [19:24:07] .... [19:24:10] I can't with you gerrit.... [19:24:17] Oh [19:24:25] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860428 (10kaldari) @jcrespo: Thanks for reminding me to list long-running script runs on the calendar. I had completely forgotten... [19:24:43] mine works, case insensitive or with proper case. [19:24:51] Dereckson: maybe they noticed that VD:xxx didn't worked and then used slashes instead of colons [19:25:10] Im wondering if this problem is fix for users that actual usernames are uppercase [19:25:21] but breaks for users who's actual username is lowercase [19:25:26] "code": "missingtitle", [19:25:30] "info": "The page you requested doesn't exist", [19:25:33] fun [19:25:38] "Please follow along the disccusion at #wikimedia-operations on freenode as we debug." [19:26:08] "code": "invalidtitle", [19:26:09] "info": "Bad title \"VD:PMF\"", [19:26:19] Reedy likly ostriches may be reindexing [19:26:24] as suggested by upstream [19:26:28] I think Gerrit broke ostriches [19:26:30] So, not through API [19:26:49] oh yes [19:26:55] let's do it through API, but using the pageid [19:26:58] paladox: I'm reindexing you as a test [19:27:02] ok [19:27:04] thanks :) [19:27:08] MarcoA: you didn't note the page id by the way? [19:27:23] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [19:27:33] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [19:27:40] Dereckson: id=6879 ns=0 dbk=VD:PMF -> VD-PMF (no conflict) [19:27:43] paladox: Ok, try your account now [19:27:47] Ok [19:27:54] MarcoA: that's the one move, I suspect there is another one [19:27:54] That worked [19:28:03] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:28:04] let me try case [19:28:05] YAY [19:28:08] insensitive [19:28:10] Case, less important for now [19:28:15] BUT YAY THE FIX WORKS [19:28:23] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [19:28:25] ostriches yay [19:28:27] it works [19:28:30] ostriches: remote: ERROR: committer email address reedy@wikimedia.org [19:28:34] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.3-13-ge39e5211b7 (SSHD-CORE-1.2.0) (protocol 2.0) [19:28:36] it's gone awol? [19:28:42] Reedy try logging in on gerrit [19:28:43] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [19:28:44] DON'T GIVE ME NEW PROBLEMS [19:28:47] GO AWAY [19:28:49] lol [19:29:24] Reedy is that the only error it gave [19:29:42] along with many other lines [19:29:45] Oh [19:29:48] Mines completely broken [19:29:49] Reedy: No, you just don't have that e-mail registered... [19:29:50] Cannot assign user name "reedy" to account 4340; name already in use. [19:29:53] ostriches: I did previous [19:29:55] ly [19:30:02] I'm saying you don't now :p [19:30:04] I can't login at all to gerrit [19:30:15] Yeah your account is among the busted. [19:30:33] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [19:30:37] Ok, now I just gotta find all the busted accounts and fix them [19:30:40] It shouldn't break again [19:30:46] (03PS2) 10Andrew Bogott: Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) [19:30:54] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) [19:31:18] ostriches we should create a test user in wikitech with uppercase username and try it in gerrit. [19:31:25] I'm not worried about the casing. [19:31:28] That doesn't matter yet. [19:31:28] just to see weather anything will affect new users too. [19:31:33] Ok [19:31:36] We need to fix the broken accounts first [19:31:49] yep [19:31:54] i wonder how do we do that? [19:32:00] as we could have a ton [19:32:06] Just gotta find the rows that are busted, shouldn't be hard. [19:32:10] oh [19:32:37] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860443 (10Cmjohnson) The thermal paste on cpu1 was nearly non-existent. Cleaned both CPU's and re-applied paste. After booting the server, the disk in slot 2 failed. A ticket has been created w... [19:32:56] Dereckson: sorry, I don't have the page ID then. I still think VD: conflicts with interwiki map vd: - I'll remove it from there and we can see if that resolves the issue. [19:33:30] let me try something before [19:36:09] sorry, haven't read you before - it's removed there but until dumpInterwiki isn't run and merged it won't take effect [19:38:29] Ok, it's 37 total busted users. [19:38:46] Oh [19:39:45] MarcoA: bd808: so, there is no VD:PMF, and the only PMF we have is Viquidites:PMF, see https://ca.wikiquote.org/w/index.php?title=Viquidites:PMF&action=info [19:40:08] The special double redirect page has still old data: The following data is cached, and was last updated 07:35, 7 December 2016. A maximum of 5,000 results are available in the cache. [19:43:26] paladox: https://phabricator.wikimedia.org/P4603 [19:43:41] Oh :) :) [19:43:57] Basically, insert all the rows like they should look. IGNORE any that are already there [19:44:17] Yep [19:44:24] thats a great sql query :) [19:46:15] Lemme test it on my test data [19:46:35] Ok :) [19:47:27] Ok....now, here's the final test. [19:47:34] Ok :) [19:47:43] PROBLEM - carbon-cache@d service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed [19:47:49] Works on the test data. [19:47:55] :) :) :) [19:47:56] (03PS3) 10Andrew Bogott: openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) [19:48:08] !gerrit: Down one last time to fix the busted accounts [19:48:14] !log gerrit: Down one last time to fix the busted accounts [19:48:23] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:53] Ok, crossing fingers.... [19:50:02] metoo [19:50:11] * greg-g throws salt over his left shoulder [19:50:20] Ok, I logged in ok.... [19:50:26] James_F: Please tell me you can login again [19:50:27] * ostriches prays [19:50:33] Err. [19:50:59] Yes. [19:51:01] :) [19:51:03] Yay! Thank you ostriches. [19:51:20] YAYAYAYAYAYAYAY [19:51:33] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [19:51:33] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [19:52:03] Reedy could you try logging in please? You should be able to use reedy and Reedy now again :) [19:52:44] Also reedy@wm.o should work again ;-) [19:52:52] (03PS1) 10Filippo Giunchedi: debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) [19:53:02] Oh :) [19:53:16] 06Operations, 06Labs, 13Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#2860488 (10ArielGlenn) We are all agreed that jenkins failing on this would be pretty annoying, since it could block unrelated changes. BUT surely we can do some sort of regular audit. [19:53:49] ostriches project access links are now fixed [19:54:00] as you pulled from upstream 2.13 branch it included the fix :) [19:54:23] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [19:54:43] RECOVERY - carbon-cache@d service on graphite1003 is OK: OK - carbon-cache@d is active [19:55:43] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:55:51] kaldari: Are you back in now? [19:55:54] Plz say yes <3 [19:56:03] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:56:04] kaldari: yep [19:56:12] ostriches: thanks! [19:56:25] YAYAYAYAY <3 [19:56:54] Well, that was a fun way to spend my Wed/Thurs/Fri! [19:56:58] LOL [19:57:00] Let's do that again real soon. [19:57:01] Not. [19:57:04] LOL [19:57:19] look at it this way, you'll have the weekend free :-P :-P [19:57:24] ostriches onto the next task https://phabricator.wikimedia.org/T152663 [19:57:25] LOL [19:57:33] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:57:40] paladox: Don't remind me! [19:57:47] After lunch, I'm taking a much needed break afk. [19:57:47] oh [19:57:51] ok [19:57:51] hehe [20:02:17] https://groups.google.com/forum/#!topic/repo-discuss/KzLJiNqu2AM [20:02:19] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2860511 (10Andrew) 05Open>03Resolved [20:02:22] ostriches i created ^^ [20:02:38] so that just in case we need to do anything that is not documented on the docs [20:03:01] I'm sure we can figure it out.... [20:03:08] I don't wanna keep annoying upstream :p [20:03:25] Ok [20:04:06] (03CR) 10Paladox: [C: 031] debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) (owner: 10Filippo Giunchedi) [20:11:26] (03Draft1) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 [20:11:29] (03Draft2) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 [20:13:26] (03PS1) 10Papaul: DNS: Add mgmt DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326165 (https://phabricator.wikimedia.org/T152612) [20:13:48] (03CR) 10jenkins-bot: [V: 04-1] DNS: Add mgmt DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326165 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [20:19:20] ostriches i accidently published a comment on that page about submodules using my real name, luckly they have a delete button, so i deleted the comment and published it under my second google account. [20:19:33] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:19:33] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:19:54] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) [20:20:03] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:21:23] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:23:46] !log mwscript updateCollation.php --wiki=bnwiki --force [20:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:33] I had to report https://bugs.chromium.org/p/gerrit/issues/detail?id=5116 [20:28:38] (03CR) 10Andrew Bogott: [C: 032] openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) [20:32:00] paladox: Yeah I saw that [20:32:07] Oh :) [20:32:20] I also reported https://bugs.chromium.org/p/gerrit/issues/detail?id=5111 earlier [20:32:49] oh [20:39:05] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:53:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [20:53:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [20:53:21] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [20:53:23] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [20:53:23] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [20:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:33] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [20:53:34] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [20:53:35] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [20:53:36] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [20:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [20:54:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [20:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [20:54:05] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [20:54:11] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [20:54:14] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [20:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:17] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [20:54:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [20:54:23] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [20:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:36] 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2860598 (10akosiaris) I 've fully repooled the servers, let's wait a couple of days and see. [20:57:21] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [20:57:22] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [20:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:43] !log fully repool scb1003, scb1004, T150882 [20:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:54] T150882: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882 [20:58:24] (03PS1) 10Andrew Bogott: Remove http monitoring for spice-proxy [puppet] - 10https://gerrit.wikimedia.org/r/326173 [20:59:58] (03CR) 10Andrew Bogott: [C: 032] Remove http monitoring for spice-proxy [puppet] - 10https://gerrit.wikimedia.org/r/326173 (owner: 10Andrew Bogott) [21:03:02] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:07] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:24:46] 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2859704 (10Legoktm) I don't think this is going to be that useful....the numbers will be all over the place, we really just need to look at th... [21:25:17] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [21:30:59] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:31:39] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:45] !log hotpatched python maintain-meta_p.py --all-databases --debug on labsdb1001/1003 [21:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:11] (03Draft1) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [21:39:14] (03Draft2) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [21:40:01] (03PS3) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [21:44:01] (03PS4) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [21:44:29] robh: hey, poking you since you're on duty :) but not high priority: do you know when we're planning to upgrdae to HHVM 3.12.11? it would fix metadata extraction for certain files. https://phabricator.wikimedia.org/T148606 [21:45:53] i have no idea, but i imagine _joe_ would (its late his timezone in the eu though so i'll ask him on monday) [21:46:29] moritz is out on leave so it may not be as fast as normal [21:46:32] but ill ask [21:52:19] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:53:09] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89236.08 seconds [21:59:39] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:00:17] (03PS4) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [22:01:08] ok, thanks [22:33:18] (03CR) 10Aaron Schulz: [C: 031] Don't retry InitImageDataJob's [puppet] - 10https://gerrit.wikimedia.org/r/326151 (owner: 10EBernhardson) [22:34:56] (03CR) 10Filippo Giunchedi: [C: 032] debian/changelog: bump upstream version [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) (owner: 10Filippo Giunchedi) [22:40:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [22:44:19] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:47:19] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:53:19] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2860862 (10fgiunchedi) 05Resolved>03Open @Cmjohnson looks like we're seeing this again on ms-be1016 :( ``` root@ms-be1016:~# hpssacli controller all show Smart Array P840 in Slot 1 (sn... [22:58:19] !log catrope@tin Synchronized php-1.29.0-wmf.5/resources/src/mediawiki.language/mediawiki.language.numbers.js: T152800 (duration: 00m 45s) [22:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:34] T152800: Notices and Alerts on Wikipedia Arabic cannot be opened (TypeError: transformTable is undefined) - https://phabricator.wikimedia.org/T152800 [23:07:27] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860886 (10jcrespo) @kaldari, this doesn't have to be synchronous. Please schedule a time with some advance notice on the Deployme... [23:07:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:11:43] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860890 (10jcrespo) Thank you very much. I will reset the RAID when the new disk gets installed (if I can handle the bios interface). A new disk failing would explain the previous RAID I/O error. [23:12:15] (03CR) 10BryanDavis: [C: 04-1] "Why would we send gerrit logs into the deployment-prep logstash instance? There is no reason for those logs to leave the production realm." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [23:13:09] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:13:55] (03CR) 10Paladox: "Hi we woulden't, I am testing it on a labs instance so this patch will change to the prod logstash once I get it working on labs." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [23:16:05] (03CR) 10BryanDavis: [C: 04-1] "You should setup your own logstash cluster in the project where you are testing this. The deployment-prep ELK stack isn't a general servic" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [23:16:34] (03CR) 10Paladox: "Ok." [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [23:30:09] (03PS3) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) [23:30:57] 06Operations, 10Mail: Create email alias for benefactors@ - https://phabricator.wikimedia.org/T152641#2855852 (10CaitVirtue) Hi All -- Can you share an ETA on this? We're triaging a high volume of email to benefactors@ right now via gmail, which is really cumbersome. Getting these messages into ZenDesk is goi... [23:31:04] (03CR) 10Filippo Giunchedi: "@volans thanks for the review! I've addressed your comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [23:31:21] 06Operations, 10Mail: Create email alias for benefactors@ - https://phabricator.wikimedia.org/T152641#2860929 (10CaitVirtue) p:05Triage>03Unbreak! [23:40:38] (03PS1) 10Chad: No need to import ValueError, it's built in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326199 [23:42:09] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:42:17] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2860965 (10fgiunchedi) p:05Triage>03High [23:46:55] ostriches i figured out how to get these submodules to work [23:46:56] yay [23:46:57] yay [23:47:25] just one more test to confirm [23:48:40] yeh [23:48:43] i figured it out [23:48:56] submiting a patch to now give it a test on prod [23:49:27] it will require us to set it on all mediawiki/extensions/* otherwise the ones that have it set will work [23:49:31] and others that doint [23:51:29] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:52:10] Can we remove it from all of them so it inherits? [23:52:48] Yeh i am trying that one [23:54:57] ostriches i carn't belive upstream did not mention you have to add subscribe. [23:54:59] yay [23:55:08] it works without having to set each and every config [23:55:11] submiting now [23:55:49] I guess we gotta update all repos... [23:55:57] Will have to do it in batch... [23:56:34] no [23:56:37] we doint [23:56:46] ostriches i managed to set it once in All-Projects [23:56:51] and it worked [23:59:31] ostriches https://gerrit.wikimedia.org/r/#/c/326200/ [23:59:32] :)