[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T0000). Please do the needful. [00:00:05] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:27] I'll swat [00:01:08] (03PS5) 10Filippo Giunchedi: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [00:02:39] \o [00:02:49] MaxSem: i guess i'm the only one :) [00:03:06] MaxSem: both patches are no-op's for prod (well, the second one removes a feature on testwiki ...) [00:03:17] (03CR) 10MaxSem: [C: 032] [cirrus] Rename CirrusSearchMoreLikeThisCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313037 (owner: 10EBernhardson) [00:03:44] it's probably worth at least pulling the second one to mw1099 to make sure nothing odd happens though [00:03:55] (03Merged) 10jenkins-bot: [cirrus] Rename CirrusSearchMoreLikeThisCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313037 (owner: 10EBernhardson) [00:04:20] MaxSem: doh, someone the commit has an event-schemas submodule update in it ... one sec i'll remove that [00:05:23] (03PS3) 10EBernhardson: [cirrus] Remove deprecated per-user poolcounter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313211 [00:05:33] fixed [00:05:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [00:06:22] !log maxsem@tin Synchronized wmf-config/CirrusSearch-labs.php: [labs only] https://gerrit.wikimedia.org/r/#/c/313037/ (duration: 00m 48s) [00:06:28] ebernhardson, ^ [00:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:04] MaxSem: beta search still appears to work [00:07:27] ebernhardson, boo-ring [00:08:02] (03CR) 10MaxSem: [C: 032] [cirrus] Remove deprecated per-user poolcounter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313211 (owner: 10EBernhardson) [00:08:45] (03Merged) 10jenkins-bot: [cirrus] Remove deprecated per-user poolcounter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313211 (owner: 10EBernhardson) [00:09:32] !log maxsem@tin Synchronized wmf-config/CirrusSearch-labs.php: [labs only] https://gerrit.wikimedia.org/r/#/c/313037/ (duration: 00m 47s) [00:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:11:14] ebernhardson, pulled on mw1099 [00:12:40] MaxSem: seems happy [00:13:12] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2784442 (10Tgr) Related: {T74328}, {T67383}, {T75935} [00:14:41] !log maxsem@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/313211/ (duration: 00m 52s) [00:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:48] ebernhardson, ^ [00:18:03] MaxSem: all looks happy. Thanks! [00:18:14] :) [00:19:31] (03PS1) 10BBlack: tlsproxy: support multiple independent staples [puppet] - 10https://gerrit.wikimedia.org/r/320704 (https://phabricator.wikimedia.org/T93927) [00:23:20] (03CR) 10BBlack: [C: 032] tlsproxy: support multiple independent staples [puppet] - 10https://gerrit.wikimedia.org/r/320704 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [00:33:37] (03PS1) 10BBlack: tlsproxy: ensure OCSP done before nginx reload [puppet] - 10https://gerrit.wikimedia.org/r/320705 [00:34:51] (03CR) 10BBlack: [C: 032] tlsproxy: ensure OCSP done before nginx reload [puppet] - 10https://gerrit.wikimedia.org/r/320705 (owner: 10BBlack) [00:35:10] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2784519 (10fgiunchedi) [00:37:32] !log remove files on iridium:/tmp older than 5d - T150396 [00:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:39] T150396: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396 [00:39:13] (03CR) 10BBlack: [C: 032] remove stapling_proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320113 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [00:39:56] (03CR) 10BBlack: [C: 032] remove readahead patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320114 (https://phabricator.wikimedia.org/T148917) (owner: 10BBlack) [00:40:02] (03CR) 10BBlack: [C: 032] remove debian perl ldflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320161 (owner: 10BBlack) [00:40:05] (03CR) 10BBlack: [C: 032] depend on lsb-base >= 3.0-6 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320162 (owner: 10BBlack) [00:40:07] (03CR) 10BBlack: [C: 032] new variant for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320629 (https://phabricator.wikimedia.org/T144523) (owner: 10BBlack) [00:40:11] (03CR) 10BBlack: [C: 032] add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [00:40:17] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2784519 (10Dzahn) 3% more space freed with "apt-get clean" [00:40:22] (03CR) 10BBlack: [C: 032] nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 (owner: 10BBlack) [00:41:18] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2784547 (10fgiunchedi) Also `/var/log/account` is 800MB, which wouldn't necessarily be an issue though `/` is very small (10G) [00:41:52] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2784548 (10Dzahn) yep, as on einsteinium, process accounting is most of the size of /var/log/, and yes, it should just be extended to larger than 10G [00:42:23] mutante: heh, needs a reimage :( [00:45:04] godog: imho we should: a) finish phab2001 (blocked on networking/DNS/ferm setup) b) switch production to contint2001 to proof it is ready and can be failed over to c) while that is the case, reinstall iridium, get a larger / , rename it to phab1001 d) switch back to phab1001 or keep that as the warm standby [00:45:39] tldr: it is supposed to be renamed anyways [00:46:14] also https://gerrit.wikimedia.org/r/#/c/317290/ etc [00:46:15] !log nginx-1.11.4-1+wmf14 uploaded to carbon jessie-wikimedia (only deployed to cp1008 for now) - T93927 - T148917 - T144523 [00:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:25] T144523: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523 [00:46:25] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [00:46:26] T93927: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927 [00:46:33] which needs https://gerrit.wikimedia.org/r/#/c/318662/ [00:48:40] 06Operations, 10Phabricator: iridium / filesystem almost full - https://phabricator.wikimedia.org/T150396#2784575 (10Dzahn) We should do this while reinstalling iridium as phab1001. Ideally we finish phab2001, fail over to that, then reinstall this one with a larger / and the new name. [00:54:44] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2784580 (10RobH) [00:54:46] 06Operations, 10ops-codfw, 10EventBus, 10netops: kafka2003 switch port configuration - https://phabricator.wikimedia.org/T150380#2784578 (10RobH) 05Open>03Resolved switch config updated [01:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T0100). [01:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:09:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:11:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:27:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [01:28:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3006529 keys, up 9 days 17 hours - replication_delay is 0 [01:40:11] (03PS1) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 [01:42:25] 06Operations, 10hardware-requests: Analytics AQS cluster expansion - https://phabricator.wikimedia.org/T149920#2768949 (10RobH) 05Open>03stalled a:03RobH I've requested quotes via the sub-task. Setting this assigned to me and stalled while the #procurement task progresses. [01:44:35] (03PS2) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [01:46:23] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784615 (10Andrew) Update -- documentation for keystone/oauth is pretty poor, and keystone has a habit of leaving partially-completed features to rot so I... [02:05:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [02:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:15:59] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:19] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 07m 23s) [02:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:19] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:33:26] (03PS1) 10Papaul: Add mgmt DNS entries for kafka2003 and 3 spare pool servers Bug: T150340 T150341 [dns] - 10https://gerrit.wikimedia.org/r/320711 [02:33:48] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 05m 52s) [02:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:19] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:38:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 10 02:38:24 UTC 2016 (duration 4m 36s) [02:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:39:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:43:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [02:44:22] (03PS1) 10Papaul: Add prod DNS entried for kafka2003 Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320713 (https://phabricator.wikimedia.org/T150340) [02:44:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:44:59] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:49:09] (03CR) 10Alex Monk: Keystone: Limit password auth to certain hosts and users. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [02:49:19] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:58:56] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2784657 (10Papaul) [03:00:42] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784668 (10AlexMonk-WMF) >>! In T150092#2784615, @Andrew wrote: > It compiles properly for labtestcontrol but not for labcontrol, which I'm not yet clear... [03:02:19] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:07:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:09:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:10:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:10:54] (03PS3) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [03:15:51] (03PS1) 10Papaul: Add DHCP entries for kafka2003 Bug:T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320716 (https://phabricator.wikimedia.org/T150340) [03:17:13] (03PS4) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [03:18:17] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2784730 (10Papaul) [03:19:45] 06Operations, 10ops-codfw: rack spare pool servers and update tracking sheet - https://phabricator.wikimedia.org/T150341#2784734 (10Papaul) [03:20:08] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784739 (10Andrew) > Maybe it needs a require network::constants before the template() or something? Yep, fixed by changing the erb lookup. [03:22:52] 06Operations, 10ops-codfw, 10netops: Spare pool servers switch configuration - https://phabricator.wikimedia.org/T150400#2784748 (10Papaul) [03:27:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.22 seconds [03:31:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.21 seconds [03:50:19] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [04:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:11:18] (03PS2) 10Mattflaschen: Add Flow test namespace to all labs wikis that have Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320647 (owner: 10Catrope) [04:16:41] (03CR) 10Mattflaschen: [C: 032] Add Flow test namespace to all labs wikis that have Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320647 (owner: 10Catrope) [04:17:04] (03PS2) 10Mattflaschen: Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 (owner: 10Catrope) [04:17:21] (03Merged) 10jenkins-bot: Add Flow test namespace to all labs wikis that have Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320647 (owner: 10Catrope) [04:17:49] (03CR) 10Mattflaschen: [C: 032] Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 (owner: 10Catrope) [04:17:54] (03PS3) 10Mattflaschen: Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 (owner: 10Catrope) [04:20:59] (03CR) 10Mattflaschen: [C: 032] Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 (owner: 10Catrope) [04:21:33] (03Merged) 10jenkins-bot: Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 (owner: 10Catrope) [04:21:49] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:26:57] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Beta Cluster only (duration: 00m 59s) [04:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:21] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta Cluster only (duration: 00m 51s) [04:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:49] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:51:39] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [05:09:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:19:39] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:29:17] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2784837 (10aaron) [06:05:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:08:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:26:39] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:39:09] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:54:39] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:59:37] <_joe_> so it's three days the mediawiki exceptions and fatals alarm goes off repeatedly [06:59:52] <_joe_> I guess no one looked into it [07:00:30] <_joe_> and it's mostly ploticus [07:07:09] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:19:26] (03PS1) 10Marostegui: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320728 (https://phabricator.wikimedia.org/T149079) [07:47:00] (03PS1) 10Marostegui: mariadb: Deploy gtid_domain_id to all the misc shards [puppet] - 10https://gerrit.wikimedia.org/r/320731 (https://phabricator.wikimedia.org/T149418) [08:01:05] !log Deploy schema change s4 commonswiki.revision (dbstore1002) - T147305 [08:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:14] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:06:16] !log rebooting bast3001 for kernel update [08:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:08:30] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 10 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2784925 (10Arrbee) [08:30:38] !log rebooting copper for kernel update [08:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:19] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:55:58] !log rebooting ruthenium for kernel update [08:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:06:39] !log rolling restart of elasticsearch on logstash100[4-6] for picking up a Java security update [09:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:21:20] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:22:14] !log elastic@codfw: reindexing commonswiki (logs in wasat.codfw.wmnet:~dcausse/commons_reindex/cirrus_log) [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:26] (03PS2) 10Marostegui: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320728 (https://phabricator.wikimedia.org/T149079) [09:27:00] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320728 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:27:44] (03CR) 10Jcrespo: [C: 031] mariadb: Deploy gtid_domain_id to all the misc shards [puppet] - 10https://gerrit.wikimedia.org/r/320731 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:27:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320728 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:28:29] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320728 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:30:45] (03PS2) 10Jcrespo: Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 [09:31:16] (03PS3) 10Jcrespo: Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 [09:31:48] (03CR) 10Marostegui: [C: 031] Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 (owner: 10Jcrespo) [09:32:57] 06Operations, 10Continuous-Integration-Infrastructure: Ubuntu Trusty mirror has Packages Hash Sum mismatch errors - https://phabricator.wikimedia.org/T150406#2785045 (10hashar) [09:35:20] (03CR) 10Marostegui: [C: 032] Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 (owner: 10Jcrespo) [09:36:03] (03Merged) 10jenkins-bot: Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 (owner: 10Jcrespo) [09:37:49] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2785057 (10Gehel) The problem is related to our old version of logstash-gelf (1.5.3). This version initializes th... [09:38:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: wmf-config/db-codfw.php Depool db1059 - T149079. Repool db2048 T150334 (duration: 00m 50s) [09:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:35] T150334: db2042 reimage - https://phabricator.wikimedia.org/T150334 [09:38:35] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [09:38:59] (03PS10) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [09:39:22] 06Operations, 10Continuous-Integration-Infrastructure: Ubuntu Trusty mirror has Packages Hash Sum mismatch errors - https://phabricator.wikimedia.org/T150406#2785065 (10hashar) Looks like the checksum in the`InRelease` files does not match / the file is more recent. [09:39:31] 06Operations, 10Continuous-Integration-Infrastructure: Ubuntu Trusty mirror has Packages Hash Sum mismatch errors - https://phabricator.wikimedia.org/T150406#2785066 (10hashar) p:05Triage>03Low [09:41:04] (03CR) 10Marostegui: [C: 032] mariadb: Deploy gtid_domain_id to all the misc shards [puppet] - 10https://gerrit.wikimedia.org/r/320731 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:41:27] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10Gehel) [09:42:59] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:31] (03PS1) 10Ema: cache_text: upgrade eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320745 (https://phabricator.wikimedia.org/T131503) [09:43:47] !log restarting druid daemons on druid100[123] for openjdk updates [09:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:23] !log Deploy gtid_domain_id mysql flag for misc shards - https://phabricator.wikimedia.org/T149418 [09:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:17] (03PS2) 10Ema: cache_text: upgrade eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320745 (https://phabricator.wikimedia.org/T131503) [09:52:22] (03CR) 10Ema: [C: 032 V: 032] cache_text: upgrade eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320745 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [09:53:45] !log upgrading cp1053 (text-eqiad) to varnish 4 -- T131503 [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:51] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [09:57:50] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2785156 (10Marostegui) The snapshots are taken so `dbstore2001` is ready to get the new disks. Later today. I have placed them at: `dbstore2002:/srv/tmp` there is one from 7th Nov and another one... [10:02:04] 06Operations, 10ops-eqiad, 10hardware-requests, 10netops, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2785180 (10jcrespo) Mark changed the vlan already. I need @Cmjohnson to assign it an ip and... [10:04:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:05:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [10:08:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:08:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:09:05] same thing happened yesterday [10:09:06] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=now-3h&to=now&var-site=All&var-cache_type=All&var-status_type=5 [10:09:12] brief spike in 503s [10:09:22] probably mw fatals related [10:10:48] !log rolling restart of zookeeper in codfw to pick up java security update [10:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:59] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:12:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:13:50] Checked 5xx on oxygen but I can see a lot of upload related ones, meanwhile it seems a text issue [10:14:08] and only esams related [10:14:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:15:20] ema: is it v4 migration related by any chance? [10:15:29] elukey: I don't think so, no [10:15:38] okok [10:20:24] elukey: the upload errors seem to be mostly 500s BTW [10:23:12] ema: yeah I am checking @oxygen:/srv/log/webrequest$ egrep '10:0[012]:' 5xx.json | grep -v upload | grep esams but it seems api related? [10:23:18] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:28] elukey: yep, POSTs to /w/api.php apparently [10:23:36] <_joe_> is someone rebooting mw1280? [10:23:41] not me [10:23:50] <_joe_> apergos maybe? [10:24:06] don't think so, the kernel reboots of mw are all complete [10:24:29] <_joe_> looking, then [10:25:30] !log rolling restart of zookeeper in eqiad to pick up java security update [10:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:03] <_joe_> !log powercycling mw1280, unresponsive to ping, blank unresponsive console [10:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:58] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [10:29:14] !log upgrading cp1054 (text-eqiad) to varnish 4 -- T131503 [10:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:21] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [10:32:28] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:33:28] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:34:48] (03PS1) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [10:37:21] (03PS2) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [10:38:22] (03PS3) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [10:40:59] (03PS4) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [10:43:30] !log upgrading cp1055 (text-eqiad) to varnish 4 -- T131503 [10:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:36] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [10:43:54] (03PS5) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [10:48:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:50:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:51:51] (03PS3) 10Muehlenhoff: Restrict access to Hive server [puppet] - 10https://gerrit.wikimedia.org/r/320574 [10:52:29] perhaps the ~ 5/s 500s in upload might be responsible for these ^ ? [10:56:17] (03CR) 10Muehlenhoff: [C: 032] Restrict access to Hive server [puppet] - 10https://gerrit.wikimedia.org/r/320574 (owner: 10Muehlenhoff) [10:56:38] (03CR) 10Alexandros Kosiaris: [C: 031] tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [10:57:40] ema: we could check sumSeries(varnish.esams.*.frontend.request.client.status.5xx.sum) [10:58:43] seems text from https://graphite.wikimedia.org/S/Bv [10:59:20] so it is basically re-using varnish-aggregate-client-status-codes [11:00:12] (If I got it correctly) [11:04:17] elukey: yep! [11:05:24] I do see MediaWiki exceptions and fatals per minute in icinga as well [11:05:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:05:43] heh there you go [11:06:20] seems like the same issue that happened yesterday with logged in users getting 503s for a little while? [11:07:21] the check is pretty late anyways, I see no 503s at the moment in text esams [11:08:19] !log upgrading cp1065 (text-eqiad) to varnish 4 -- T131503 [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [11:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:11:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:11:41] ema: https://graphite.wikimedia.org/S/Bw - really weird [11:11:48] but why only in esams? [11:12:19] more requests at this time of the day? [11:12:49] could be yes, did we get the same alers during EU night? [11:13:45] I grepped on my IRC logs and it seems only esams :( [11:14:03] maybe it could be a bug affecting EU sites mostly [11:14:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:14:21] 06Operations, 10netops: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2230372 (10faidon) This is almost certainly a Juniper bug. asw-a/b/c/d-codfw currently run JunOS 14.1X53-D27.3 and asw2-d-eqiad runs 14.1X53-D35.3, the currently JTAC-recommended. Since this issue has persisted... [11:16:05] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2785288 (10faidon) p:05Normal>03High [11:18:52] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2785297 (10faidon) asw2-d-eqiad has been confirmed to be affected by T133387 and enabling IGMP snooping on the QFXes breaks IPv6. Since this aff... [11:21:50] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2785300 (10Gehel) Re-reading this after some sleep, it seems that @EBernhardson nailed all the answers (as alway... [11:24:11] ema: from https://graphite.wikimedia.org/S/Bx it seems that the first spike happened at ~13ish UTC two days ago, when we had the network issue. But then at around 16/17 the spikes are starting [11:25:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:25:35] elukey: so that seems to confirm it's mediawiki exceptions. What's the best way to investigate those? hhvm.log? [11:25:58] !log upgrading cp1066 (text-eqiad) to varnish 4 -- T131503 [11:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:05] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [11:26:38] I am checking mw logs on flourine [11:26:45] but not really getting a lot of out it :D [11:26:55] there are a couple of deployments aligning to our timings on the SAL [11:27:56] hashar: sorry to ping you again, but are you aware of the currenct mw exceptions? (or maybe are you aware of any tracking task) [11:28:20] (more specifically https://graphite.wikimedia.org/S/Bw) [11:28:25] elukey: which of the thousands of exceptions do you think of ? :) [11:28:33] ah nice [11:28:47] as usual I am blaming the job runners [11:28:49] how about the ones yesterday that were hhvm fatals exceptions with no data [11:29:12] whatever those are called, the ones that look like meta-exceptions :) [11:29:13] a slightly nicer view is on Grafana https://grafana.wikimedia.org/dashboard/db/production-logging [11:29:37] the top right graph "MediaWiki exceptions and fatal rates" should have the same graphite query as the icinga check [11:30:04] thresholds might or might not be in sync. But in theory whenever there is a red bar, you get a corresponding icinga notification [11:30:12] all of that comes from logstash really [11:30:21] ah niceeeeeee [11:30:57] looking at 3 or 4 days period [11:31:15] seems the first of those red bar came on Nov 8th 17:00 UTC [11:31:20] or 6pm CET [11:31:31] <_joe_> and they are periodic, more or less [11:31:54] yeah [11:32:05] bunch of logs at https://tools.wmflabs.org/sal/production?p=2 [11:32:18] new Kernels (unlikely to be a cause) [11:32:39] <_joe_> very unlikely [11:32:43] Varnish 4 on codfw [11:32:53] what about the deployments? [11:32:54] unlikely as well [11:33:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:33:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:33:29] <_joe_> deployments resumed for the euro swat after that network window [11:33:41] seems the spike is on the hour sharp [11:33:46] 17:00 18:00 etc [11:35:40] then knowing that. Gotta dig in the exception logs in https://logstash.wikimedia.org/ [11:35:43] <_joe_> so this morning we're having peaks of 503s that are unrleated to those exceptions IMO [11:35:44] or fluorine exception.log [11:35:58] <_joe_> looking at https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=text&var-status_type=5 [11:36:03] could be yes [11:36:15] we were wondering what was the cause of the alarms [11:36:39] <_joe_> ema: can this be related to changes in varnish in eqiad? [11:37:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:37:32] <_joe_> actually no, esams goes through dallas [11:37:35] the alarms are at best a reason to go look at graphs, they're not very indicative usually [11:37:40] labswiki silver CAS update failed on user_touched for user ID 'XXX' (read from replica); the version of the user to be saved is older than the current version. [11:37:42] (03PS1) 10Muehlenhoff: Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 [11:37:53] <_joe_> bblack: I'm looking at the graphs :) [11:37:56] some hourly cron hits wikitech api.php [11:38:19] and the LdapAuthentication plugin fails to save settings for a specific user id [11:38:28] nice! [11:38:38] <_joe_> hashar: ok, let's ignore that [11:38:50] <_joe_> we're seeing spikes of 503s out of text in esams [11:39:01] well let's save it somewhere, it should be fixed [11:39:09] _joe_: nope, unrelated to eqiad [11:39:35] hashar: thanks! Let's open a phab task and then keep going with the 503s [11:39:35] the CAS failure is https://phabricator.wikimedia.org/T150373 [11:39:55] super [11:40:55] (03CR) 10Mobrovac: [C: 04-1] Deploy EventStreams on scb and configure LVS service in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [11:40:58] it's not eqiad for sure [11:41:18] those spikes are in both ulsfo and esams, but not codfw and not eqiad [11:41:50] but, I think the pattern there is confusing because with v4 we're getting a different sort of error behavior than before [11:42:53] I think the actual errors are happening at the codfw<->mediawiki boundary. they're fatals where (from varnish's perspective) mw drops the connection. [11:43:16] in the past, these would've generated a 503 there in the codfw backend which then bubbles up as a normal error response to e.g. ulsfo and esams. [11:43:36] I think with the default do_stream=true, the errors are reported differently and we have some cleanup to do on how X-Cache makes them look too [11:43:45] <_joe_> bblack: so you think it's mediawiki errors? because that doesn't match what I see in mw errors very well [11:43:59] (because the frontend is involved, it's like a temporary open pipe from the user->mw) [11:44:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:44:16] _joe_: yes, we saw some of these yesterday [11:44:33] more or less good view https://logstash.wikimedia.org/goto/9fb8a6f2422fd08d71cfca478128498a [11:45:22] _joe_: the code "16777217" errors that have basically no detail, when exception handling doesn't really even work [11:45:53] !log upgrading cp1067 (text-eqiad) to varnish 4 -- T131503 [11:45:57] <_joe_> hashar: yeah but those are wikitech-related errors mostly, so they don't really count [11:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:59] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [11:46:36] I think it's entirely possible those kinds of errors under v4 are now defeating our retry-503-once strategy because they look so different. [11:47:00] but I donno, it's all very confusing [11:47:20] bblack: can it be related to the patch introduced by our varnish wm3? [11:47:30] ema: quite possibly! [11:47:48] maybe the extrachance code was useful in these cases [11:48:33] in any case, though, at worst it's making a lower-layer 503 more-apparent and less-retryable. it's not causing the initial issue. [11:49:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:50:25] the one with headers from a user yesterday, it was clearly a varnish-generated 503 (from a connection error at some layer), but the relevant headers said "x-cache: cp3033 pass" + "x-cache-status: bug" [11:50:46] whereas if error-handling always worked like we thought it did, it should've been an "int" response [11:50:46] bblack: yes I've seen one of those this morning too [11:50:59] the one I've seen was a POST to /w/api.php [11:51:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:52:06] my theory is this comes from the kind of 503 where the applayer dropped the request ungracefully (closed the connection with no response or some kind of unparseable garbage) [11:52:40] and that default-streaming in v4 on this pass or miss means the whole stack is exposed to the ungracefulness, not just the bottom-most varnish [11:53:50] (03PS2) 10Muehlenhoff: Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 [11:54:25] 06Operations, 10DBA: mysql boxes not in ganglia - https://phabricator.wikimedia.org/T87209#2785350 (10jcrespo) 05Open>03declined [11:54:30] but yeah, the change in wm3 could be exacerbating that [11:54:51] we could try a single-shot extrachance too [11:55:00] but it would be nice to have a real idea what's happening in these cases [11:55:02] they all seem to be POST requests [11:55:22] that was the case yesterday too (specifically, logged-in ones) [11:56:09] (03CR) 10Jdlrobson: "I think this can be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295926 (https://phabricator.wikimedia.org/T138578) (owner: 10Dr0ptp4kt) [11:56:14] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/4580/ - LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/320758 (owner: 10Muehlenhoff) [11:56:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:57:05] they align in esams+ulsfo though (traffic-relative) [11:57:17] so the real cause is either in codfw or in MW [11:58:30] now that I think about the POST angle, our 503-retry probably shouldn't ever attempt a retry on POST since it's non-idempotent :) [11:58:43] there are a few HHVM rendering criticals in icinga for codfw hosts [11:59:01] mw2093, 2098, 2099, 2104, ... [11:59:06] yeah but we're not using those [11:59:08] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:59:11] codfw varnish -> eqiad MW [11:59:16] ah, right! [12:00:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:01:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:01:53] !log upgrading cp1068 (text-eqiad) to varnish 4 -- T131503 [12:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:00] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [12:03:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:04:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:05:31] varnish 4 migration completed :) [12:06:01] we still have to route traffic back as normal, but there are no v3 hosts left [12:07:05] (03PS1) 10BBlack: VCL: only retry 503 for idempotent methods [puppet] - 10https://gerrit.wikimedia.org/r/320760 [12:07:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:07:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:08:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:08:31] 06Operations, 10DBA, 13Patch-For-Review: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2785367 (10jcrespo) 05Open>03Resolved a:03jcrespo closing because the ongoing issues were fixed, and long-term fixes will be done on T124307 [12:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:10:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:11:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:19:11] ema: awesome work! [12:20:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:20:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:22:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:26:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:27:08] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:42:05] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2785460 (10elukey) Thanks for pinging me, I didn't notice the problem since (afaics from #wikimedia-operations) statsv on hafnium did... [13:00:15] (03PS7) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [13:03:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:04:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:07:18] !log restart wdqs2* for jvm update [13:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:52] 06Operations, 10netops: Low IPv6 bandwth from Free.fr (AS12322) > Zayo > eqiad - https://phabricator.wikimedia.org/T150374#2785512 (10faidon) 05Open>03declined Since there is only one intermediate ASN between Proxad and us (Zayo) and the path within Zayo and within our network is basically exactly the same... [13:20:42] !log restart wdqs1* for jvm update [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] (03PS1) 10Faidon Liambotis: mirrors: update to a newer Ubuntu mirror script [puppet] - 10https://gerrit.wikimedia.org/r/320772 [13:21:48] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:22:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320773 (https://phabricator.wikimedia.org/T149079) [13:23:27] (03PS2) 10Faidon Liambotis: mirrors: update to a newer Ubuntu mirror script [puppet] - 10https://gerrit.wikimedia.org/r/320772 [13:23:45] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: update to a newer Ubuntu mirror script [puppet] - 10https://gerrit.wikimedia.org/r/320772 (owner: 10Faidon Liambotis) [13:24:55] (03PS8) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [13:25:13] (03CR) 10Elukey: [C: 032 V: 032] First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) (owner: 10Elukey) [13:25:30] !log Restarting mysql in misc shard slaves (only codfw - db2010,db2012,db2030) to apply a MySQL config - T149418 [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:38] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [13:26:07] 06Operations, 10Continuous-Integration-Infrastructure: Ubuntu Trusty mirror has Packages Hash Sum mismatch errors - https://phabricator.wikimedia.org/T150406#2785538 (10faidon) 05Open>03declined We use Ubuntu's recommended mirroring method (a dual-pass rsync) to mirror from an official Ubuntu source, rsync... [13:26:31] marostegui, try to upgrade those servers at the same time [13:26:43] jynus: 10.0.28? [13:26:54] more interested on kernel updates [13:27:00] ah, sure :) [13:27:11] but mariadb is only relevant on jessie systems [13:27:24] also check tls and gtid [13:27:34] ok - will do! [13:27:37] it takes very little if you are going to restart them [13:27:42] reimage takes more time [13:28:06] what do you want me to check in regards to tls? [13:28:07] that we we save time in the long run [13:28:16] that it is enabled with the latest certs [13:28:39] puppet rather than the ones initially deployed (pm for details) [13:29:35] !log cache_maps: upgrade nginx to 1.11.4-1+wmf14 [13:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:00] marostegui, I did https://gerrit.wikimedia.org/r/#/c/319831/1/modules/role/manifests/mariadb.pp [13:30:16] but may not be enough for all servers, needs checking, etc. [13:30:25] Ah I see [13:30:46] I think misc.cnf it only for s1, s2 and s5 [13:30:54] s3 is phabricator [13:31:08] and s4 is eventlogging, that has their own special config [13:31:29] Yeah I am checking and the one that does not have GTID is the one without SSL [13:31:34] s/^s/m/ [13:32:00] if it doesn't take much time away, commit the new config [13:32:05] (03PS1) 10Ema: cache_text: route codfw back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320774 (https://phabricator.wikimedia.org/T131503) [13:34:03] I do not want to put you a lot of burden, so feel free to ignore that, but it may save some time in the long run [13:34:36] could somebody hop onto an imagescaler and run `convert --version` and tell me the output, or otherwise find out what version of imagemagick we're running in production? [13:35:38] (03CR) 10Ema: [C: 031] VCL: only retry 503 for idempotent methods [puppet] - 10https://gerrit.wikimedia.org/r/320760 (owner: 10BBlack) [13:36:14] (03PS1) 10BBlack: cache_maps: separate ECDSA/RSA stapling [puppet] - 10https://gerrit.wikimedia.org/r/320775 [13:36:16] (03PS1) 10BBlack: cache_misc: separate ECDSA/RSA stapling [puppet] - 10https://gerrit.wikimedia.org/r/320776 [13:36:19] https://www.irccloud.com/pastebin/bP8m0gj6/ [13:36:31] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: separate ECDSA/RSA stapling [puppet] - 10https://gerrit.wikimedia.org/r/320775 (owner: 10BBlack) [13:36:46] gehel: thank you [13:36:54] MatmaRex: np [13:37:34] (03CR) 10BBlack: [C: 031] cache_text: route codfw back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320774 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:38:13] (03PS2) 10Ema: cache_text: route codfw back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320774 (https://phabricator.wikimedia.org/T131503) [13:38:19] (03CR) 10Ema: [C: 032 V: 032] cache_text: route codfw back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320774 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:39:14] 06Operations, 10Continuous-Integration-Infrastructure: Ubuntu Trusty mirror has Packages Hash Sum mismatch errors - https://phabricator.wikimedia.org/T150406#2785549 (10hashar) The blog link is a fascinating read for the history geek I am. Thank you I heave learned a few things about the repositories layout.... [13:42:27] (03PS1) 10Ema: cache_text: route esams back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320778 (https://phabricator.wikimedia.org/T131503) [13:43:08] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2785558 (10Gehel) the following metrics can be deleted: `kartotherian.req.s*` [13:45:25] !log cache_misc: upgrade nginx to 1.11.4-1+wmf14 [13:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:09] (03PS2) 10BBlack: cache_misc: separate ECDSA/RSA stapling [puppet] - 10https://gerrit.wikimedia.org/r/320776 [13:47:13] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: separate ECDSA/RSA stapling [puppet] - 10https://gerrit.wikimedia.org/r/320776 (owner: 10BBlack) [13:49:11] jouncebot: next [13:49:11] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1400) [13:49:23] no changes [13:49:37] zeljkof: swat is empty today :] [13:49:46] hashar: excellent :) [13:50:48] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:51:28] !log Enabled gtid+ssl on db2010,db2012,db2030 [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:37] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320773 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [13:52:00] hashar: can I deploy then wmf-config/db-eqiad.php ? [13:54:17] !log restarting varnishes on cache_maps + cache_misc [13:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:43] (03PS2) 10Ottomata: Add DHCP entries for kafka2003 Bug:T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320716 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [13:58:28] (03CR) 10Ottomata: [C: 032] Add DHCP entries for kafka2003 Bug:T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320716 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [13:59:31] (03CR) 10Ottomata: [C: 032] Add mgmt DNS entries for kafka2003 and 3 spare pool servers Bug: T150340 T150341 [dns] - 10https://gerrit.wikimedia.org/r/320711 (owner: 10Papaul) [13:59:43] (03CR) 10Ottomata: [C: 032] Add prod DNS entried for kafka2003 Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320713 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [13:59:45] (03PS2) 10Ottomata: Add prod DNS entried for kafka2003 Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320713 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [13:59:47] (03CR) 10Ottomata: [V: 032] Add prod DNS entried for kafka2003 Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320713 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1400). Please do the needful. [14:00:42] nothing to do [14:01:15] hashar: can I proceed then? [14:01:21] marostegui: yeah [14:01:24] \o/ [14:01:25] thanks [14:01:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320773 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [14:02:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1068 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320773 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [14:04:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1068 - T149079 (duration: 00m 48s) [14:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:31] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [14:05:01] (03PS11) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [14:05:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [14:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:08:39] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2785632 (10Jgreen) 05Open>03Resolved HPSA reports RAID1 OK. Thanks! [14:16:09] (03CR) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [14:16:18] (03PS2) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [14:16:46] moritzm: you know stuff about imagemagick, right? do you remember when and why we upgraded to 6.8.9-9? is upgrading or downgrading soon possible? the current version is messing up CMYK thumbnails, newer and older are fine. https://phabricator.wikimedia.org/T141739 [14:18:46] (03CR) 10BBlack: [C: 031] cache_text: route esams back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320778 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:19:12] (03PS2) 10Ema: cache_text: route esams back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320778 (https://phabricator.wikimedia.org/T131503) [14:19:16] (03CR) 10Ema: [C: 032 V: 032] cache_text: route esams back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320778 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:20:23] !log cache_upload: upgrade nginx to 1.11.4-1+wmf14 [14:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] MatmaRex: 6.8.9.9 is the version shipped in Debian jessie, we didn't explicitily upgrade to that release. Ubuntu trusty uses 6.7.7.10 [14:31:59] !log cache_text: upgrade nginx to 1.11.4-1+wmf14 [14:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:51] (03PS1) 10BBlack: tlsproxy: separate ECDSA/RSA stapling for all [puppet] - 10https://gerrit.wikimedia.org/r/320783 [14:38:31] (03CR) 10BBlack: [C: 032] tlsproxy: separate ECDSA/RSA stapling for all [puppet] - 10https://gerrit.wikimedia.org/r/320783 (owner: 10BBlack) [14:39:06] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [14:39:16] (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [14:40:08] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:40:28] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:41:22] ^marostegui, that is probably you restarting some misc servers, right? [14:41:49] yes [14:41:50] that should be m5, I think [14:42:08] next time should I downtime dbproxy boxes? [14:42:12] no [14:42:22] I like them up so I can see if 2 down [14:42:33] which would be really bad :-) [14:43:22] but I had to ask in case it was a real failure [14:44:08] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [14:44:28] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [14:46:02] moritzm: so, downgrading to 6.7.7.10 would be an option? how about upgrading to 7.x? i don't think that's packaged for anything yet :/ [14:47:42] !log de-pooling mw1284 to raise mod_proxy_fcgi log level manually (temporary for an ongoing investigation) [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:47] MatmaRex: moritzm: ImageMagick 7 API changed a little bit, and should be carefully tested, see for example https://www.imagemagick.org/script/porting.php [14:50:02] Per https://phabricator.wikimedia.org/T141739#2785739 6.8.3 is the best candidate [14:51:18] MatmaRex: I've some issue on a migration to ImageMagick 7 with SVG and transparency for example. [14:51:54] so to upgrade to 7 = a high risk to get a lot of new issues [14:52:57] I don't think it's convenient to use on several servers different ImageMagick versions by the way, a locally managed deb could make sense instead to rely on OS shipped version [14:53:14] Dereckson: hmm, yeah, that looks scary [14:53:24] we don't use imagemagick for svgs, in particular, but still [14:53:51] !log applying schema change on s3 (page) T69223 [14:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [14:54:12] we will see if we bring down production [14:55:19] * apergos raises an eyebrow [14:56:01] hey, I am just doing a schema change which "should be trivial and easy" [14:56:08] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:01] famouslastwords :-P [15:01:41] !log restored mw1284 to its settings [15:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] 06Operations, 06Commons, 06Multimedia: Deploy ImageMagick 6.8.3 from apt.wikimedia.org - https://phabricator.wikimedia.org/T150432#2785785 (10Dereckson) [15:04:56] 06Operations, 06Commons, 06Multimedia: Deploy ImageMagick 6.8.3 from apt.wikimedia.org - https://phabricator.wikimedia.org/T150432#2785801 (10Dereckson) >>! In T141739#2785767, @Dereckson wrote: > To summarize a discussion on wikimedia operations: > > - ImageMagick 7 API evolved, A [[ https://www.imagemag... [15:05:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:09:28] !log Deploy schema change s4 commonswiki.template links (db1068) - https://phabricator.wikimedia.org/T149079 [15:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:47] marostegui, maybe I can take care of T73563 at the same time that it is depooled? [15:11:47] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [15:12:04] jynus: Sure, the alter will take long, so it will be depooled until tomorrow [15:12:33] this is not fast, either, maybe even slower [15:12:40] no worries [15:12:51] I can extend the downtime tomorrow if needed [15:12:51] I want to know your opinion [15:13:03] so I do not make your change slower [15:13:11] jynus: No worries, it is fine :) [15:13:29] Ok, I will schedule it on deployments and run it now [15:13:43] cool [15:13:56] see how slow it is and then decide if we continue doing them toghether [15:14:59] ok [15:21:29] 06Operations, 07Puppet, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2785849 (10Halfak) p:05Normal>03Low [15:24:08] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:25:55] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2785852 (10Krinkle) @elukey @Ottomata Also see the syslog paste in the task description. The log ends with the exception that, I beli... [15:27:10] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2785853 (10elukey) Yes I added the comment about the statsv source code line after checking that one :) [15:30:22] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Build calico - https://phabricator.wikimedia.org/T150434#2785879 (10Joe) [15:30:50] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build calico - https://phabricator.wikimedia.org/T150434#2785892 (10Joe) p:05Triage>03High a:03Joe [15:32:08] PROBLEM - NTP on db2010 is CRITICAL: NTP CRITICAL: Offset unknown [15:32:12] (03PS1) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [15:36:57] (03PS12) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [15:40:08] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:08] RECOVERY - NTP on db2010 is OK: NTP OK: Offset 0.0002968013287 secs [15:44:21] (03PS13) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [15:45:25] (03PS1) 10Muehlenhoff: Update to 1.1.0c and drop merged fix-read-ahead.patch [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320789 [15:49:09] marostegui: hello let me know when you are ready [15:49:11] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2785982 (10Gehel) `kartotherian.req.s*` metrics are deleted [15:49:22] papaul: Let me power the server off for you then :) [15:50:13] papaul: Server going down now. Did you check the disks to make sure we are good to go? https://phabricator.wikimedia.org/T143874#2783159 [15:50:35] marostegui:all 12 disks are 2T [15:51:56] papaul: excellent [15:52:33] 06Operations, 10DBA, 07Upstream: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#2785995 (10Marostegui) [15:53:19] papaul: Server should be down now I believe as per the ILO Server Power: Off [15:53:41] (03PS2) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [15:53:43] (03PS5) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [15:53:44] marostegui: thanks [15:53:45] (03PS1) 10Andrew Bogott: Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) [15:57:07] (03PS2) 10Muehlenhoff: Update to 1.1.0c and drop two merged patches [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320789 [16:03:08] (03PS3) 10Muehlenhoff: Update to 1.1.0c and drop two merged patches [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320789 [16:03:12] (03PS2) 10BBlack: VCL: only retry 503 for idempotent methods [puppet] - 10https://gerrit.wikimedia.org/r/320760 [16:03:19] (03CR) 10BBlack: [C: 032 V: 032] VCL: only retry 503 for idempotent methods [puppet] - 10https://gerrit.wikimedia.org/r/320760 (owner: 10BBlack) [16:04:42] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2786035 (10fgiunchedi) 05Open>03Resolved This is complete, aggregation method changed everywhere. [16:05:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:08:22] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:08] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:09:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:09:56] grrrit-wm: nick [16:10:00] grrrit-wm1: nick [16:10:08] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2786070 (10Gehel) I'm learning things... The "rate" of a timer for statsite is `timer_sum(&t->tm) / GLOBAL_CONFIG->flush_inter... [16:10:14] grrrit-wm1: nick [16:10:39] grrrit-wm1: restart [16:10:41] re-connecting to gerrit [16:10:42] reconnected to gerrit [16:11:06] grrrit-wm1: nick [16:11:20] grrrit-wm1: force-restart [16:11:21] re-connecting to gerrit and irc. [16:12:02] re-connected to gerrit and irc. [16:12:22] mutante ^^ first real test when it changes it nick, restarted it without having to ssh in and manually restart it :) [16:24:15] such chatty grrrit-wm [16:26:12] (03PS3) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [16:26:14] (03PS2) 10Andrew Bogott: Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) [16:26:16] (03PS6) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [16:27:52] (03CR) 10jenkins-bot: [V: 04-1] Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [16:27:55] (03CR) 10jenkins-bot: [V: 04-1] Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [16:28:01] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [16:31:05] (03PS4) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [16:31:07] (03PS3) 10Andrew Bogott: Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) [16:31:09] (03PS7) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [16:32:41] (03PS5) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [16:32:43] (03PS4) 10Andrew Bogott: Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) [16:32:46] (03PS8) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [16:32:48] (03CR) 10jenkins-bot: [V: 04-1] Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [16:33:43] (03PS3) 1020after4: Move config for git-ssh(phabricator) to hiera [puppet] - 10https://gerrit.wikimedia.org/r/318662 (https://phabricator.wikimedia.org/T143363) [16:34:53] (03CR) 1020after4: Move config for git-ssh(phabricator) to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318662 (https://phabricator.wikimedia.org/T143363) (owner: 1020after4) [16:36:08] marostegui: all disks are in place [16:36:18] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:36:20] papaul: ok, going to launch the reimage script [16:37:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [16:38:19] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2753248 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto... [16:39:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:45:40] (03PS1) 10Alexandros Kosiaris: Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) [16:46:47] Pchelolo: so, I'm looking at the node-mapnik stuff [16:47:10] it looks like node-mapnik requires g++ >= 5, including a new enough libstdc++ [16:47:23] I assume that's the latest node-mapnik, not that the one that we run [16:47:36] but I also assume that the one that we run is not compatible with Node.js 6? [16:47:36] paravoid: ye, as I understand that's why yurik is using the 3.5.13 version [16:48:01] (03CR) 10Filippo Giunchedi: [C: 032] Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [16:48:06] (03PS6) 10Filippo Giunchedi: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [16:48:16] right, so for Node.js 6 we need a new-enough node-mapnik (which version?) and for that we need g++ 5 [16:49:13] paravoid: the issue is that they use node-pre-gyp - it's a tool that let's you download precompiled binaries from amazon instead of building your own [16:49:30] * paravoid closes ears and yells lalalalala [16:49:39] and for node 6 + node-mapnik 3.5.13 that was never uploaded by node-mapnik maintainers [16:50:26] so we can upgrade to newer g++, node-mapnik 3.5.14 or set up our own build of node-mapnik [16:50:40] for the latter we need libmapnik3-dev that's only in stretch, not in jessie [16:50:51] and the build process itself is pretty involved [16:52:00] paravoid: I think I should create a task and list all that I've gathered about this, it will be easier to explore our options once we have all the info in one place. I'll create it as soon as I get to the office [16:52:29] we need to either a) build node-mapnik 3.5.13 for node 6, or b) build node-mapnik 3.5.14 for node 6 + jessie [16:52:38] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786211 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto... [16:52:42] either way, we can't get out of this without building node-mapnik, no? [16:53:48] paravoid: we can if we upgrade to 3.5.14 and newer stdlib, we will be able to use the precompiled builds from node-pre-gyp then, but yurik had some other problems with the newer version I think. I'm not quite sure [16:54:07] we can't really use the newer libstdc++ on a jessie system [16:54:13] it's going to be an insane amount of work to do that [16:54:28] bummer :( [16:54:31] i was kinda hoping for the new stdlib [16:54:42] isn't there any way to have them sidebyside? [16:54:53] i meant libc [16:54:56] we'll need to backport gcc 5+ to jessie [16:54:59] no you mean libstdc++ [16:55:37] it's not just a library, it's part of the toolchain [16:55:40] so, no [16:56:16] sigh. Ok, in that case we really need to figure out how to package it - either by figuring out how to build mapnik from scratch (ideal), or just repackage it in a more cntrolled way. Thing is, we can totally use .14 - if only we could compile it with the libstdc++ [16:56:39] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786229 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto... [16:56:44] uh? [16:57:21] paravoid, best scenario - simply build node-mapnik + mapnik (c lib) from source [16:57:45] which version and with which compiler? [16:58:27] latest. My understanding is that it uses gcc [16:58:46] the latest version uses C++11, which relies on gcc 5 [16:58:53] !log T133395: Performing next 25 RESTBase table conversions to TWCS [16:58:54] hence the requirement for a newer libstdc++ [16:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:01] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [16:59:09] i think it does not have to use the latest compiler - it was just a preference [16:59:12] i might be mistaken of course [17:00:01] again, i shouldn't speak about what i don't know - i remember building it caused a lot of problems because it required some packages to be installed, that were not explicit [17:00:05] hashar, thcipriani, and mutante: Dear anthropoid, the time has come. Please deploy CI Migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1700). [17:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1700). Please do the needful. [17:00:20] Judging by their travis config the latest should also be compilable with clang [17:00:29] the best person to speak about it is pnorman [17:00:38] he's in the interactive channel [17:00:41] our new contractor [17:00:41] yes, I think you're mistaken [17:00:48] they are clearly mentioning C++11 [17:01:18] ehhh no CI migration today :) /me updates deployment page [17:02:14] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786255 (10Marostegui) The reason for so many script runs is that the server doesn't get reimaged if it not on. ``` Set Chassis Power Control to Cycle failed: Command not supported in present sta... [17:02:58] paravoid, even building older version from scratch would be great - no need to download and ensure that there is a node6 version available. Btw, there are many people who complain about how hard it is to build mapnik [17:03:14] ye, hm, in travis they compile with clang-3.8 which is available in jessie-backports [17:03:30] ha, nice hack [17:03:32] in any case, pnorman is the person who knows the most about this [17:03:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:03:46] he's on the west coast [17:04:38] (03PS1) 10Alexandros Kosiaris: profile::docker::builder: Make the hiera calls parameters [puppet] - 10https://gerrit.wikimedia.org/r/320794 [17:05:24] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786294 (10Volans) The issue with `wmf-reimage` is tracked in T150448 [17:05:27] (03PS3) 10Andrew Bogott: Add wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/319909 [17:07:12] (03CR) 10Andrew Bogott: [C: 032] Add wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/319909 (owner: 10Andrew Bogott) [17:07:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:09:27] jynus: some user tried to move a page on de.wikipedia, and got three times a db error, I checked logstash it's Lock wait timeout exceeded; try restarting transaction, do you need to be pinged about such user facing db failures or do you have monitoring of such problems? [17:10:08] I have monitoring [17:10:48] but if the error happens every time it should be reported- but not to me, to a mediawiki-general-or-unknown bug [17:11:29] here, it failed twice, then worked the third times [17:11:42] those are normally performance problems with unfrequent queries- devs will contact me if they need help [17:12:13] up to some time it is normal for things to fail once, then they work next time [17:17:53] logstash contains 33 occurences for a slow query or a failure in 24h [17:18:05] for this query [17:18:39] strange, it's always on de. [17:20:58] Dereckson, if you want to open a ticket,please do -but if any, that looks like a mediawiki bug [17:21:46] * Dereckson nods [17:21:57] I am not saying I will not help [17:22:17] I am saying a mediawiki export needs to see it first and triage [17:22:19] yes I understand dba care isn't a magic to make query fast when there isn't any index in code [17:22:43] and code optimization could be required too [17:22:50] (03CR) 10Thcipriani: "One other nit" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [17:23:01] then they can ask for my opinion on optimization or if they see something wrong on configuration, etc. [17:23:03] I'm filling tht [17:23:10] thank you, Dereckson [17:25:45] (03CR) 10Chad: `scap patch` tool for applying patches to a wmf/branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [17:28:31] (03CR) 1020after4: `scap patch` tool for applying patches to a wmf/branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [17:28:49] 06Operations, 06Performance-Team, 10Thumbor: Make Thumbor IM engine based on a subprocess - https://phabricator.wikimedia.org/T149903#2786361 (10fgiunchedi) So in {T149985} I had to revert `MAGICK_TIME_LIMIT` environment variable because it was killing python/thumbor process as a whole, looks like forking or... [17:30:48] 06Operations, 10ops-codfw: rack spare pool servers and update tracking sheet - https://phabricator.wikimedia.org/T150341#2786396 (10RobH) [17:36:08] (03CR) 10Chad: `scap patch` tool for applying patches to a wmf/branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [17:39:01] (03PS4) 10Filippo Giunchedi: site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) [17:42:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 1.1.0c and drop two merged patches [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320789 (owner: 10Muehlenhoff) [17:44:48] (03PS1) 10Papaul: Fix mgmt DNS entries for kafka2003 and 3 spare pool servers the first mgmt IP's used were not in the config file but are responding to ping and pointing to exsting server (db2021,db2022..) Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320803 (https://phabricator.wikimedia.org/T150340) [17:46:09] !log T133395: Convert final 25 RESTBase tables to TWCS [17:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:15] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [17:46:54] 06Operations, 10puppet-compiler: puppet compiler fails with modules using puppetdb - https://phabricator.wikimedia.org/T150456#2786449 (10fgiunchedi) [17:48:14] (03PS1) 10Ladsgroup: labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 [17:49:01] (03PS2) 10Ladsgroup: labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) [17:51:17] (03PS1) 10Muehlenhoff: Fix build failure by updating d2i-tests.tar [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320808 [17:52:46] (03CR) 10Marostegui: [C: 031] "I will try to find out what is going on with those two hosts." [dns] - 10https://gerrit.wikimedia.org/r/320803 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [17:55:14] (03PS1) 10Papaul: Fix MAC for kafka2003 Bug: T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320809 (https://phabricator.wikimedia.org/T150340) [17:55:31] (03CR) 10Jcrespo: [C: 031] "Ok with that, but needs Chase approval." [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) (owner: 10Ladsgroup) [17:56:00] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 659 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3035388 keys, up 10 days 9 hours - replication_delay is 659 [17:57:46] so the mediawiki fatals may be redis-related? https://logstash.wikimedia.org/goto/93224bb5449a3155b0035eb256105248 [17:58:11] (03PS2) 10RobH: Fix MAC for kafka2003 Bug: T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320809 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [17:58:55] (03CR) 10RobH: [C: 032] Fix MAC for kafka2003 Bug: T150340 [puppet] - 10https://gerrit.wikimedia.org/r/320809 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1800). Please do the needful. [18:00:38] No ORES for today [18:03:33] no parsoid today [18:03:35] nope [18:03:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:03:52] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2786530 (10Cmjohnson) Updated to my comment about the ssd's fitting for the aqs systems. I attached the ssd with the SFF adapter to the disk caddy for the R720XD and the s... [18:05:04] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2786534 (10RobH) Please note that the 3 R720xd systems on the sub-task, they won't fit SSDs into the LFF hot swap slots. So t... [18:06:24] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786548 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto... [18:06:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:06:52] (03PS5) 10Filippo Giunchedi: site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) [18:07:00] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3026890 keys, up 10 days 9 hours - replication_delay is 0 [18:11:10] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=78%) [18:11:12] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2786567 (10RobH) a:05RobH>03Eevans [18:15:11] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marker metrics - https://phabricator.wikimedia.org/T150353#2786576 (10Gehel) Note: metrics needs to be cleaned from both graphite1001.eqiad.wmnet and graphite2001.codfw.wmnet [18:19:14] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:34] (03PS1) 10Madhuvishy: labstore: Update check_drbd_role script to check role of the given host only [puppet] - 10https://gerrit.wikimedia.org/r/320813 [18:20:10] RECOVERY - Disk space on analytics1027 is OK: DISK OK [18:20:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix build failure by updating d2i-tests.tar [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320808 (owner: 10Muehlenhoff) [18:21:21] (03CR) 10Madhuvishy: [C: 032] labstore: Update check_drbd_role script to check role of the given host only [puppet] - 10https://gerrit.wikimedia.org/r/320813 (owner: 10Madhuvishy) [18:21:28] (03PS2) 10Madhuvishy: labstore: Update check_drbd_role script to check role of the given host only [puppet] - 10https://gerrit.wikimedia.org/r/320813 [18:21:34] (03CR) 10Madhuvishy: [V: 032] labstore: Update check_drbd_role script to check role of the given host only [puppet] - 10https://gerrit.wikimedia.org/r/320813 (owner: 10Madhuvishy) [18:22:02] (03CR) 10Filippo Giunchedi: "LGTM generally, a couple of comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [18:23:49] (03CR) 10Volans: "Nice, a couple of comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [18:24:07] 06Operations, 06Discovery, 06Maps: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter - https://phabricator.wikimedia.org/T150460#2786600 (10Gehel) [18:24:38] (03PS1) 10Muehlenhoff: Cope with new libssl1.1 symbols introduced in 1.1.0c [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320814 [18:28:05] jouncebot next [18:28:05] In 0 hour(s) and 31 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1900) [18:28:21] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2786655 (10bearND) Another idea I brought up with GWicke on IRC yesterday is to embed the original dimensions in the URL (either as query parameters or... [18:29:44] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2786657 (10bearND) > look into which users rely on the current thumb format The apps and MCS rely on the current thumb format to resize thumbnails dow... [18:30:15] (03PS6) 10Filippo Giunchedi: site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) [18:31:30] 06Operations: Update legal-tm-vio@ alias - https://phabricator.wikimedia.org/T150463#2786668 (10Slaporte) [18:31:57] (03CR) 10Volans: "If merging this as-is with all together there is a race condition that a host will get the new prometheus remote host before Prometheus is" [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [18:33:52] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2786684 (10Cmjohnson) p:05Triage>03Normal [18:35:03] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbstore2001.codfw.wmnet'] ``` and were **ALL** successful. [18:35:55] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [18:35:56] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2786705 (10Cmjohnson) Received the new cables and finished with the row redundancy less the d1 to d8 link which will need fiber. [18:36:00] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320450 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [18:37:51] (03CR) 10Filippo Giunchedi: "@Volans no race condition as it won't error on the end host or puppet. Polls from Prometheus server will fail until the next 20/30 min unt" [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [18:39:00] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:01] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:01] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:01] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:01] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:01] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:10] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:10] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:11] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:11] PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:20] PROBLEM - puppet last run on dbproxy1007 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:20] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:23] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786716 (10Marostegui) The server got reinstalled and looks good: ``` root@dbstore2001:~# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.6... [18:39:29] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2786717 (10Cmjohnson) a:05Cmjohnson>03RobH [18:39:30] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:34] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2786719 (10Marostegui) a:03Marostegui [18:39:40] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:40] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:40] PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta] [18:39:40] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta] [18:39:40] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:41] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:41] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:42] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:50] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:50] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:50] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:39:50] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta] [18:40:00] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:00] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta] [18:40:01] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:01] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:10] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:10] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:11] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:40:30] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:41:41] uhhhh ? [18:42:07] I'm guessing some apt update failage? [18:42:10] PROBLEM - NTP on etcd1003 is CRITICAL: NTP CRITICAL: Offset -1.092672318 secs [18:42:24] yeah, I restarted nginx on sodium [18:42:26] sorry [18:42:37] !log upgrading (and restarting) nginx on sodium [18:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:34] we really need to get rid of these silly latest constraints in standard_packages... [18:44:57] yes we do [18:45:11] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2786745 (10GWicke) @robh, we need SSDs in these boxes. Could we use generic 2.5"->3.5" adapters to fit 2.5" SSDs in 3.5" slots? [18:46:18] (03CR) 10Filippo Giunchedi: [C: 032] site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [18:48:10] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:48:21] 06Operations, 06Discovery, 06Maps: publish kartotherian / tilerator metrics by cluster - https://phabricator.wikimedia.org/T150466#2786766 (10Gehel) [18:49:09] (03PS2) 10Papaul: Fix mgmt DNS entries for kafka2003 and 3 spare pool servers the first mgmt IP's used were not in the config file but are responding to ping and pointing to exsting server (db2021,db2022..) Bug:T150340 [dns] - 10https://gerrit.wikimedia.org/r/320803 (https://phabricator.wikimedia.org/T150340) [18:52:30] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 55 seconds ago with 5 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/targets/mysql-labs_ulsfo.yaml],File[/srv/prometheus/ops/targets/mysql-misc_ulsfo.yaml],File[/srv/prometheus/ops/targets/mysql-parsercache_ulsfo.yaml],File[/srv/prometheus/ops/targets/mysql-dbstore_ulsfo.yaml] [18:54:51] why did that fail? [18:55:02] missing dirs? [18:55:22] no [18:55:24] missing files [18:55:25] missing files [18:55:27] yeah [18:55:31] they need to be created empty [18:55:49] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2786807 (10Cmjohnson) [18:55:49] (remember I initially checked if they existed) [18:55:51] 06Operations, 10ops-eqiad: Rename potassium / WMF3287 as poolcounter1002 - https://phabricator.wikimedia.org/T149106#2786805 (10Cmjohnson) 05Open>03Resolved Updated racktables and switch description [18:56:11] and I think you told me to assume it (or somone else did :-)) [18:56:11] either that or special case for sites where mysql exists, probably easier to go for empty now [18:56:14] (03PS1) 10Andrew Bogott: Fix the private puppet hooks to point to where the repo actually is [puppet] - 10https://gerrit.wikimedia.org/r/320818 [18:57:01] likely, I'll followup [18:57:25] !log uploaded openssl 1.1.0c for jessie-wikimedia to carbon [18:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:53] (03CR) 10Dzahn: [C: 031] Fix the private puppet hooks to point to where the repo actually is [puppet] - 10https://gerrit.wikimedia.org/r/320818 (owner: 10Andrew Bogott) [18:58:15] yeah, I think I initially sent a patch with [ original file, /dev/null ], I think [18:58:17] (03PS2) 10Andrew Bogott: Fix the private puppet hooks to point to where the repo actually is [puppet] - 10https://gerrit.wikimedia.org/r/320818 [18:58:54] (03PS1) 10Filippo Giunchedi: standard: install prometheus node_exporter on every site [puppet] - 10https://gerrit.wikimedia.org/r/320819 [18:58:58] and we said no because they had to be generated anyway [18:59:26] I will do that at some point [18:59:27] indeed [19:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T1900). Please do the needful. [19:00:05] tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:17] (03CR) 10Andrew Bogott: [C: 032] Fix the private puppet hooks to point to where the repo actually is [puppet] - 10https://gerrit.wikimedia.org/r/320818 (owner: 10Andrew Bogott) [19:00:47] (03CR) 10Filippo Giunchedi: [C: 032] standard: install prometheus node_exporter on every site [puppet] - 10https://gerrit.wikimedia.org/r/320819 (owner: 10Filippo Giunchedi) [19:00:50] (03PS2) 10Filippo Giunchedi: standard: install prometheus node_exporter on every site [puppet] - 10https://gerrit.wikimedia.org/r/320819 [19:00:55] o/ [19:01:01] I can SWAT today [19:02:23] (03PS2) 10Thcipriani: Make beta PageViewInfo use the production pageview API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320450 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:02:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320450 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:02:32] (03CR) 10Volans: "LGTM, looking forward to see some test too ;)" (031 comment) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [19:03:18] (03Merged) 10jenkins-bot: Make beta PageViewInfo use the production pageview API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320450 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:03:56] (03PS3) 10Thcipriani: Add PageViewInfo log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:04:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:04:21] tgr: your labs change should go out with the next round of beta-code-update-eqiad/beta-scap-eqiad [19:04:25] (03CR) 10Filippo Giunchedi: [V: 032] standard: install prometheus node_exporter on every site [puppet] - 10https://gerrit.wikimedia.org/r/320819 (owner: 10Filippo Giunchedi) [19:04:35] (03Merged) 10jenkins-bot: Add PageViewInfo log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) (owner: 10Gergő Tisza) [19:05:45] tgr: PageViewInfo log channel is live on mw1099, check please [19:06:27] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: labstore1003 - RAID fail - https://phabricator.wikimedia.org/T149156#2743689 (10Cmjohnson) Swapped failed disk [19:06:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:06:40] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:07:00] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:07:00] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:07:00] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:07:00] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:07:10] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:07:10] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:07:10] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:07:10] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:07:10] RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:07:11] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:07:20] RECOVERY - puppet last run on dbproxy1007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:07:20] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:07:30] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:07:40] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:07:40] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:07:40] RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:07:40] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:07:40] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:07:41] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:07:41] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:07:45] thcipriani: PageViewInfo is beta-only but I verified that nothing blows up [19:07:50] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:07:50] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:07:50] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:07:50] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:08:00] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:08:00] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:08:00] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:08:00] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:08:01] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:08:01] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:08:10] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:08:15] tgr: great, thanks, will go live after a sync to ensure masters both have the labs change. [19:08:27] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2786855 (10RobH) >>! In T136340#2786745, @GWicke wrote: > @robh, we need SSDs in these boxes. Could we use generic 2.5"->3.5"... [19:08:30] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:09:21] (03PS6) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [19:09:23] (03PS1) 10Jcrespo: Enable unix socket authentication everywhere [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) [19:09:25] (03CR) 10RobH: [C: 032] Fix mgmt DNS entries for kafka2003 and 3 spare pool servers the first mgmt IP's used were not in the config file but are responding to ping [dns] - 10https://gerrit.wikimedia.org/r/320803 (https://phabricator.wikimedia.org/T150340) (owner: 10Papaul) [19:09:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:320450|Make beta PageViewInfo use the production pageview API (T129602)]] (labs-only-change) (duration: 00m 48s) [19:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:58] T129602: Deploy WikimediaPageViewInfo extension to beta cluster - https://phabricator.wikimedia.org/T129602 [19:10:10] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10Cmjohnson) -Confirmed power supply is not working, reseated and still not working. HP support request needs to be submitted. [19:12:10] RECOVERY - NTP on etcd1003 is OK: NTP OK: Offset -0.009302705526 secs [19:12:13] (03PS1) 10Filippo Giunchedi: role: add placeholder files for prometheus/mysql in ulsfo/esams [puppet] - 10https://gerrit.wikimedia.org/r/320823 [19:12:39] jynus: ^ [19:13:45] (03PS7) 10Jcrespo: mariadb-labs: Prepare db1095 to be the new sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) [19:14:13] (03CR) 10Jcrespo: [C: 031] role: add placeholder files for prometheus/mysql in ulsfo/esams [puppet] - 10https://gerrit.wikimedia.org/r/320823 (owner: 10Filippo Giunchedi) [19:14:19] (03PS1) 10Andrew Bogott: Keystone: remove explicit observer rights [puppet] - 10https://gerrit.wikimedia.org/r/320825 (https://phabricator.wikimedia.org/T150092) [19:14:21] (03PS1) 10Andrew Bogott: Keystone: Make the project list public [puppet] - 10https://gerrit.wikimedia.org/r/320826 (https://phabricator.wikimedia.org/T150092) [19:14:24] (03PS1) 10Andrew Bogott: Make compute:get fully public [puppet] - 10https://gerrit.wikimedia.org/r/320827 (https://phabricator.wikimedia.org/T150092) [19:14:37] deploy, we will solve it better later [19:15:13] when I say we, I mean I, I intend to work on that [19:15:19] but not now [19:15:44] (03PS2) 10Filippo Giunchedi: role: add placeholder files for prometheus/mysql in ulsfo/esams [puppet] - 10https://gerrit.wikimedia.org/r/320823 [19:15:55] indeed [19:16:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] role: add placeholder files for prometheus/mysql in ulsfo/esams [puppet] - 10https://gerrit.wikimedia.org/r/320823 (owner: 10Filippo Giunchedi) [19:16:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:320552|Add PageViewInfo log channel (T129602)]] (duration: 00m 49s) [19:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:59] T129602: Deploy WikimediaPageViewInfo extension to beta cluster - https://phabricator.wikimedia.org/T129602 [19:17:05] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: lutetium RAID disk failed - https://phabricator.wikimedia.org/T149904#2786912 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [19:17:28] tgr: changes are live on cluster, will go out on beta shortly [19:17:41] thanks thcipriani! [19:19:29] !log update RESTBase to 6bfa0f75f - staging [19:19:30] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] 06Operations, 10RESTBase, 10Traffic, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2786916 (10Pchelolo) 05Open>03Resolved Merged and deployed. Resolving. Hopefully we've got all of the CORS edge cases now. [19:20:27] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2786920 (10GWicke) Is this a general limitation, or some limitation of the particular adapter used? We'd save a lot of money a... [19:22:24] !log upgrading openssl to 1.1.0c on cache_* [19:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:58] (03PS6) 10Andrew Bogott: Keystone: open up firewall for public keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [19:23:00] (03PS5) 10Andrew Bogott: Check password/ip whitelist for wmtotp. [puppet] - 10https://gerrit.wikimedia.org/r/320791 (https://phabricator.wikimedia.org/T150092) [19:23:02] (03PS9) 10Andrew Bogott: Keystone: Limit password auth to certain hosts and users. [puppet] - 10https://gerrit.wikimedia.org/r/320706 (https://phabricator.wikimedia.org/T150092) [19:23:04] (03PS1) 10Andrew Bogott: Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) [19:24:32] !log update RESTBase to 6bfa0f75f - canary on restbase1007 [19:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:03] !log update RESTBase to 6bfa0f75f [19:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:05] !log upgrading libssl1.1 to 1.1.0c on other misc hosts... [19:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:12] (03PS2) 10Ppchelko: RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 [19:32:19] (03PS1) 10Filippo Giunchedi: templates: add prometheus.svc for ulsfo/esams [dns] - 10https://gerrit.wikimedia.org/r/320831 [19:33:12] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2787049 (10AlexMonk-WMF) [19:33:55] !log cache_*: restarting nginx for libssl update (seamless) [19:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:49] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2787067 (10jcrespo) [19:36:17] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787072 (10fgiunchedi) [19:36:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:37:31] !log elastic@eqiad: reindexing commonswiki (logs in terbium.eqiad.wmnet:~dcausse/commons_reindex/cirrus_log) - T150232 [19:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:37] T150232: CirrusSearch: normal word get obscure used as keyword (namespace) - https://phabricator.wikimedia.org/T150232 [19:39:30] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787097 (10fgiunchedi) Also the same happens on all esams misc frontend, needs further investigation [19:48:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:50:14] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2787116 (10RobH) That particular adapter may work, I'd advise we purchase just one and test with spare non s3610 SSDs that are... [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T2000). Please do the needful. [20:01:35] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2787187 (10AlexMonk-WMF) [20:02:19] (03PS1) 10Yurik: LABS: enable mapframe everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320835 [20:04:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:08:23] 06Operations, 06Performance-Team, 10scap, 07Epic: During deployment old servers may populate new cache URIs - https://phabricator.wikimedia.org/T47877#2787205 (10Krinkle) 05Open>03Resolved a:03Krinkle [20:08:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:08:54] grrrit-wm: nick [20:11:21] thcipriani, are you doing the train? [20:12:19] yurik: nope twentyafterfour is on train duty today: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T2000 [20:13:07] ah,sorry. i looked and then forgot :( short term memory lapse :) [20:13:25] yurik: what's up? [20:13:34] twentyafterfour, if you have a sec, could you also do https://gerrit.wikimedia.org/r/320835 [20:13:38] its a labs config change [20:13:43] ok [20:13:47] thx :) [20:15:10] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:15:58] (03CR) 10Filippo Giunchedi: [C: 032] templates: add prometheus.svc for ulsfo/esams [dns] - 10https://gerrit.wikimedia.org/r/320831 (owner: 10Filippo Giunchedi) [20:18:56] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2787229 (10Gilles) a:03Gilles [20:19:47] (03CR) 1020after4: [C: 032] LABS: enable mapframe everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320835 (owner: 10Yurik) [20:20:20] (03Merged) 10jenkins-bot: LABS: enable mapframe everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320835 (owner: 10Yurik) [20:21:11] yurik: can you test when I sync it? [20:22:13] twentyafterfour, its a labs config, i think it only gets there with the next autopull [20:22:23] oh [20:22:41] so it doesn't need a scap? [20:23:01] twentyafterfour ill ask -labs for you if you wish? [20:23:50] twentyafterfour, it does - because otherwise some magical script will complain [20:24:03] that servers don't have the same state as git master [20:24:17] right [20:24:27] so just sync-file it [20:24:29] yep [20:24:34] thanks :) [20:25:12] !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings-labs.php: sync LABS: enable mapframe everywhere (I35709ed2903b28a2d4d6e8528ac1fcf361483e76) (duration: 00m 50s) [20:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:18] !log neon - deactivate puppet node [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] so should we go ahead with the train even with the recurring "MediaWiki exceptions and fatals per minute on graphite1001" alerts? [20:29:08] the alert is apparently caused by wikitech which is already on wmf.2 [20:32:44] !log neon - shutdown -h now (scheduled 3 days downtime, nothing that looked worth saving in homes) [20:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:55] !log neon - deactivated puppet node, scheduled icinga downtime, shutdown server permanently (T125023) [20:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:01] T125023: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023 [20:40:51] I guess I'm going ahead with it [20:43:10] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:44:22] (03PS7) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [20:45:21] (03PS1) 1020after4: all wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320838 [20:45:23] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320838 (owner: 1020after4) [20:45:55] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320838 (owner: 1020after4) [20:45:58] 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2787320 (10thcipriani) [20:47:27] jouncebot next [20:47:27] In 3 hour(s) and 12 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161111T0000) [20:49:19] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.2 [20:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:34] Fatal error: Call to undefined method ApiQueryContributions::isInGeneratorMode() in /srv/mediawiki/php-1.29.0-wmf.2/extensions/ORES/includes/ApiHooks.php on line 172 [20:50:46] ! [20:51:04] Amir1, ^ [20:51:18] (03PS1) 1020after4: Revert "all wikis to 1.29.0-wmf.2" (Fatal errors spike) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320840 [20:51:27] (03CR) 1020after4: [C: 032] Revert "all wikis to 1.29.0-wmf.2" (Fatal errors spike) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320840 (owner: 1020after4) [20:51:56] (03Merged) 10jenkins-bot: Revert "all wikis to 1.29.0-wmf.2" (Fatal errors spike) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320840 (owner: 1020after4) [20:52:04] hey twentyafterfour. I just logged on and got pinged iwth your message because it contains "ores" [20:52:06] What's up? [20:52:27] Looks like we're deploying ".2" but there [20:52:31] 's an issue in the ORES ext? [20:52:45] yeah I just deployed wmf.2 and got a spike of errors [20:53:02] undefined method ApiQueryContributions::isInGeneratorMode() in ApiHooks.php:172 [20:53:10] OK. I'm not very familiar with the MW code. trying to raise Amir1 [20:53:15] looking [20:53:26] it's not a total emergency I'm rolling back the deployment [20:53:30] kk [20:53:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [20:53:57] (03CR) 10Faidon Liambotis: [C: 04-1] Introduce a system wide systemd check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [20:54:22] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert "all wikis to 1.29.0-wmf.2" (Fatal errors spike) [20:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:30] OK so first of all, I need to figure out who worked on this. I think it might have been anomie. [20:54:33] [12:54:25] (PS1) Legoktm: Only check isInGeneratorMode() on instances of ApiQueryGeneratorBase [extensions/ORES] - https://gerrit.wikimedia.org/r/320843 [20:54:53] halfak: looks like lego already got it [20:55:01] wonderful [20:55:05] <3 legoktm [20:55:13] Thanks legoktm [20:55:36] np :) [20:56:17] 06Operations, 05Prometheus-metrics-monitoring: Deploy federation for Prometheus - https://phabricator.wikimedia.org/T150486#2787337 (10fgiunchedi) [20:58:33] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2787369 (10Nuria) 05Open>03Resolved [20:58:35] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2787370 (10Nuria) [21:00:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:03:38] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787386 (10GWicke) >>! In T66214#2786655, @bearND wrote: > Another idea I brought up with GWicke on IRC yesterday is to embed the original dimensions i... [21:03:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:04:45] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Log thumbnail requests that fail on Thumbor and not on Mediawiki and vice versa - https://phabricator.wikimedia.org/T147918#2787388 (10fgiunchedi) This is now merged and live, different status codes from mw and thumbor are reported in swift logs [21:06:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:08:11] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.2/extensions/ORES/includes/ApiHooks.php: deploy I86e97b05b56b90d956616ef16e8aa86d96403b8c (duration: 00m 47s) [21:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:18] grrrit-wm: nick [21:09:33] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320851 (owner: 1020after4) [21:09:36] Nick is already grrrit-wm not changing the nick. [21:09:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:09:53] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.2 [21:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:13:41] (03CR) 10Aaron Schulz: "https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/loadmonitor/LoadMonitorMySQL.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [21:15:03] !log kafka2003 - signing puppet certs, salt-key, initial run [21:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:27] Pchelolo: so, I have the binary built, but how do we deploy that now... [21:22:59] Pchelolo: node-mapnik is actually in Debian, so we could in theory rebuild that, but it's an endless chain of npm dependencies, I stopped at like 4 or 5 [21:23:14] Pchelolo: I did build mapnik 3.0 in jessie though, that worked out fine [21:23:28] paravoid: do you have build instructions and a list of packages? I can set up the docker script for kartothenian [21:23:52] Pchelolo: and after that was done, with a little more effort I managed to convince node-pre-gyp to build [21:23:55] then we'd deploy it like any other service with binary dependencies [21:23:58] yeah, gimme a sec [21:28:48] PROBLEM - puppet last run on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:28:58] PROBLEM - salt-minion processes on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:28:58] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:30:18] PROBLEM - DPKG on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:30:28] PROBLEM - Disk space on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:30:38] PROBLEM - MD RAID on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:31:18] PROBLEM - configured eth on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:31:28] PROBLEM - dhclient process on kafka2003 is CRITICAL: Return code of 255 is out of bounds [21:32:28] RECOVERY - Disk space on kafka2003 is OK: DISK OK [21:32:28] RECOVERY - dhclient process on kafka2003 is OK: PROCS OK: 0 processes with command name dhclient [21:32:38] RECOVERY - MD RAID on kafka2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [21:32:58] RECOVERY - salt-minion processes on kafka2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:33:18] RECOVERY - DPKG on kafka2003 is OK: All packages OK [21:33:18] RECOVERY - configured eth on kafka2003 is OK: OK - interfaces up [21:33:48] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:44:45] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787495 (10GWicke) >>! In T66214#2786657, @bearND wrote: >> look into which users rely on the current thumb format > > The apps and MCS rely on the cu... [21:45:41] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2787496 (10Papaul) [21:46:12] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2782678 (10Papaul) a:05Papaul>03Ottomata @Ottomata it is all yours now. [21:47:40] 06Operations, 10ops-codfw: rack spare pool servers and update tracking sheet - https://phabricator.wikimedia.org/T150341#2787503 (10Papaul) [21:48:30] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2787507 (10Ottomata) Yeehaw, thanks Papaul! [21:51:22] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2787513 (10CRoslof) So, as I understand it, there is no problem with the current state of affairs that the requested change would fix. Also, I ha... [21:53:03] Pchelolo: https://people.wikimedia.org/~faidon/mapnik/HOWTO [21:53:12] Pchelolo: I've tried retracing my steps, I hope I'm not missing anything [21:53:33] awesome, thank you paravoid I'll try to deckerize this in kartothenian [21:53:38] let's see what happens [21:54:10] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [21:57:57] are you dockerizing it for travis? [21:58:05] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:58:43] we need to somehow deploy this in our prod in the end, right? [21:58:49] which is not docker, at least not yet :) [22:01:00] paravoid: we are using a jessie docker instance to build the deploy repo, I think that's what Pchelolo meant [22:01:13] oh really [22:02:10] (03PS2) 10Dzahn: remove neon.wikimedia.org, keep neon.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318444 (https://phabricator.wikimedia.org/T125023) [22:03:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Cope with new libssl1.1 symbols introduced in 1.1.0c [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320814 (owner: 10Muehlenhoff) [22:03:44] paravoid: https://wikitech.wikimedia.org/wiki/Services/Deployment#Regular_Deployment ; the build command is part of service-runner & uses docker under the hood to provide an environment that's consistent with production [22:05:35] interesting [22:07:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:07:46] (03CR) 10Dzahn: [C: 032] remove neon.wikimedia.org, keep neon.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318444 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [22:09:44] 06Operations, 10Icinga, 10Shinken, 13Patch-For-Review: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#2787553 (10Dzahn) The only related patch left would be https://gerrit.wikimedia.org/r/#/c/318442/1/modules/monitoring/manifests/group.pp but t... [22:10:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:11:47] 06Operations, 10ops-eqiad: decom neon (data center) - https://phabricator.wikimedia.org/T150490#2787556 (10Dzahn) [22:12:29] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787573 (10Tgr) >>! In T66214#2786657, @bearND wrote: > Requirements for a long term solution should include, in addition to scaling, the ability to cr... [22:12:34] (03PS1) 10Kaldari: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 [22:13:00] 06Operations, 10ops-eqiad: decom neon (data center) - https://phabricator.wikimedia.org/T150490#2787578 (10Dzahn) [22:13:03] jouncebot, next [22:13:03] In 1 hour(s) and 46 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161111T0000) [22:13:09] jouncebot, now [22:13:09] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [22:13:15] jouncebot, refresh [22:13:18] I refreshed my knowledge about deployments. [22:13:20] jouncebot, reload [22:13:24] jouncebot, now [22:13:24] For the next 1 hour(s) and 59 minute(s): Deploy https://gerrit.wikimedia.org/r/#/c/320854/ (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161110T2213) [22:14:00] 06Operations, 10ops-eqiad: decom neon (data center) - https://phabricator.wikimedia.org/T150490#2787556 (10Dzahn) [22:14:36] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2787591 (10Dzahn) [22:14:38] 06Operations, 10Icinga, 10Shinken, 13Patch-For-Review: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) 05Open>03Resolved resolving, workflow continues in subtask for dc-ops T125023. [22:15:18] 06Operations, 10Icinga: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#2787595 (10Dzahn) [22:15:39] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2030058 (10Dzahn) [22:15:56] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2030059 (10Dzahn) neon shutdown and removed from DNS, count: 7 [22:17:05] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [22:26:44] 06Operations, 06Performance-Team, 10Thumbor: Record OOM kills as a metric with mtail - https://phabricator.wikimedia.org/T148962#2787666 (10fgiunchedi) Some analysis from syslog on thumbor machine for oom-kills ``` thumbor1001# zgrep -h 'thumbor invoked oom-killer' syslog.7.gz syslog.6.gz syslog.5.gz syslog... [22:27:09] jouncebot help [22:27:09] **** JounceBot Help **** [22:27:09] JounceBot is a deployment helper bot for the Wikimedia Foundation. [22:27:09] You can find my source at https://github.com/mattofak/jouncebot [22:27:09] Available commands: [22:27:09] DIE Kill this bot [22:27:09] HELP Prints the list of all commands known to the server [22:27:10] NEXT Get the next deployment event(s if they happen at the same time) [22:27:11] NOW Get the current deployment event(s) or the time until the next [22:27:11] REFRESH Refresh my knowledge about deployments [22:28:43] so if I wanted to sync messages for just one branch, the proper command would be `scap sync-l10n 10n 1.29.0-wmf.2`? [22:29:13] err, scap sync-l10n 1.29.0-wmf.2 [22:29:32] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 12 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[swift-account-replicator],Service[swift-account-reaper],Service[swift-account-auditor],Service[swift-object] [22:29:55] maybe, twentyafterfour or Reedy know? ^^^ :P [22:30:02] PROBLEM - swift-account-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:30:12] PROBLEM - swift-account-reaper on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:30:22] PROBLEM - swift-account-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:30:37] Can't you just do it all, as the others would be a noop? [22:30:40] I've not done a scap in ages [22:31:02] RECOVERY - swift-account-auditor on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:31:08] kinda wanted it to be FASTER :D [22:31:12] RECOVERY - swift-account-reaper on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:31:22] RECOVERY - swift-account-replicator on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:31:32] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:32:42] ms-be1027 is me, sorry about the noise [22:35:39] !log maxsem@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 01m 01s) [22:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:31] bleh, doesn't work [22:37:37] okay, full scap [22:38:07] !log maxsem@tin Started scap: https://gerrit.wikimedia.org/r/#/c/320864/ [22:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:21] MaxSem: I think so [22:39:10] twentyafterfour, it didn't work :0 [22:39:36] didn't work as in faield with an error or just didn't seem to have the right effect? [22:40:13] I don't think that code path has had a lot of testing [22:41:38] didn't update the wikis [22:42:33] I'll still assume that it was me doing smth wrong, but it'd be nice to have docs :} [22:47:56] probably needed update-l10n before sync-l10n [22:49:22] :O [22:55:02] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:58:12] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:12] PROBLEM - HHVM rendering on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:55] (03PS1) 10Aaron Schulz: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) [23:00:50] !log maxsem@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/320864/ (duration: 22m 42s) [23:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:04] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2787803 (10Deskana) [23:03:21] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10Deskana) I think this is in progress, so I've marked it as such. [23:07:20] (03PS1) 10Madhuvishy: labstore: Add cluster ip assignment monitoring to drbd cluster [puppet] - 10https://gerrit.wikimedia.org/r/320930 [23:07:42] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:08:12] (03PS2) 10Madhuvishy: labstore: Add cluster ip assignment monitoring to drbd cluster [puppet] - 10https://gerrit.wikimedia.org/r/320930 [23:08:14] (03PS4) 10Krinkle: contint: Remove 'integration/phpcs' deployment source [puppet] - 10https://gerrit.wikimedia.org/r/301523 [23:08:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:09:08] (03CR) 10Krinkle: [C: 031] Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [23:10:01] (03CR) 10Krinkle: "FYI: This isn't blocked on the core commit that adds the option since these maintenance scripts silently ignore unknown options." [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [23:10:52] !log mw1185 - service hhvm restart [23:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:02] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.023 second response time [23:13:02] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 70088 bytes in 0.083 second response time [23:15:29] (03CR) 10Faidon Liambotis: [C: 032] contint: Remove 'integration/phpcs' deployment source [puppet] - 10https://gerrit.wikimedia.org/r/301523 (owner: 10Krinkle) [23:19:10] (03CR) 10Madhuvishy: [C: 032] labstore: Add cluster ip assignment monitoring to drbd cluster [puppet] - 10https://gerrit.wikimedia.org/r/320930 (owner: 10Madhuvishy) [23:19:16] (03PS3) 10Madhuvishy: labstore: Add cluster ip assignment monitoring to drbd cluster [puppet] - 10https://gerrit.wikimedia.org/r/320930 [23:19:43] (03CR) 10Madhuvishy: [V: 032] labstore: Add cluster ip assignment monitoring to drbd cluster [puppet] - 10https://gerrit.wikimedia.org/r/320930 (owner: 10Madhuvishy) [23:24:00] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:28:55] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787864 (10bearND) >>! In T66214#2787573, @Tgr wrote: > Is there any situation where that information could not be easily provided alongside in a more... [23:30:10] (03PS1) 10Madhuvishy: labstore: Grant nrpe permissions to read output of cluster ip monitor script [puppet] - 10https://gerrit.wikimedia.org/r/320934 [23:32:31] (03CR) 10Madhuvishy: [C: 032] labstore: Grant nrpe permissions to read output of cluster ip monitor script [puppet] - 10https://gerrit.wikimedia.org/r/320934 (owner: 10Madhuvishy) [23:42:38] (03PS6) 10Dzahn: Add puppet-lint to Rakefile / Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/288620 (owner: 10Hashar) [23:46:16] (03PS1) 10Madhuvishy: labstore: Add drbd service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/320935 (https://phabricator.wikimedia.org/T144633) [23:48:15] (03PS2) 10Madhuvishy: labstore: Add drbd service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/320935 (https://phabricator.wikimedia.org/T144633) [23:52:45] (03CR) 10Dzahn: [C: 032] Add puppet-lint to Rakefile / Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/288620 (owner: 10Hashar) [23:54:56] 06Operations, 06Labs, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2787910 (10madhuvishy) [23:55:27] (03CR) 10Dzahn: "is this ready to go now? re: "depends on Id23483286ae2549bfd6f1377c6a0d0c0898b88c4 and Iaedf0d4903c0fd9a9cca3e648a2a9691f54c6af8 which wil" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [23:58:11] (03CR) 10Dzahn: Stop using package=>latest for standard packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [23:59:14] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80342 MB (15% inode=99%)