[02:35:26] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 14m 20s) [02:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:18] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Sep 17 02:46:18 UTC 2018 (duration 10m 52s) [02:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 846.46 seconds [04:00:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 281.25 seconds [04:16:39] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1198 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [05:16:56] (03PS2) 10KartikMistry: hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) [05:28:00] !log Deploy schema change on s1 eqiad master (db1067) - T67448 T114117 T51191 [05:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:13] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [05:28:13] T51191: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 [05:28:14] T67448: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 [05:30:03] (03CR) 10jerkins-bot: [V: 04-1] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:31:57] 10Operations, 10ops-eqiad, 10DBA: Degraded disk on db1069 (x1 master) - https://phabricator.wikimedia.org/T204462 (10Marostegui) [05:32:11] 10Operations, 10ops-eqiad, 10DBA: Degraded disk on db1069 (x1 master) - https://phabricator.wikimedia.org/T204462 (10Marostegui) p:05Triage>03Normal [05:32:30] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,7 instance=db1069:9100 job=node site=eqiad Marostegui T204462 - The acknowledgement expires at: 2018-09-19 05:32:17. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [05:36:18] !log Deploy schema change on s3:testwiki directly on the master (db2043) - T201011 [05:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:26] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [05:47:03] !log Deploy schema change on s3:testwikidatawiki directly on the master (db2043) - T201011 [05:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:11] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [05:50:58] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [05:50:59] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: add upload rewrites conditional [puppet] - 10https://gerrit.wikimedia.org/r/460777 [05:51:50] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [05:55:53] (03CR) 10Giuseppe Lavagetto: "PCC attests this is a noop on the current configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/460777 (owner: 10Giuseppe Lavagetto) [05:57:08] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: add upload rewrites conditional [puppet] - 10https://gerrit.wikimedia.org/r/460777 (owner: 10Giuseppe Lavagetto) [06:01:53] (03CR) 10Giuseppe Lavagetto: "Updated PCC here https://puppet-compiler.wmflabs.org/compiler1002/12471/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:03:59] !log Deploy schema change on s3 - T201011 [06:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:06] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [06:05:16] (03PS3) 10KartikMistry: hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) [06:28:28] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/dhparam.pem] [06:31:42] !log Deploy schema change on s4 - T201011 [06:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:50] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [06:32:28] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/varnishmtail-backend/varnishbackend.mtail] [06:33:31] !log Deploy schema change on s7 (frwiktionary,metawiki) - T201011 [06:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:39] PROBLEM - MariaDB Slave SQL: s7 on db1094 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP trr_user_page_revision: check that column/key exists on query. Default database: metawiki. [Query snipped] [06:37:37] ^ checking that [06:41:04] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [06:42:08] RECOVERY - MariaDB Slave SQL: s7 on db1094 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:47:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 and db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460786 (https://phabricator.wikimedia.org/T201011) [06:50:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 and db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460786 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [06:51:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 and db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460786 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [06:53:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 and db1098:3317 - T201011 (duration: 01m 04s) [06:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:48] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [06:53:51] !log Stop replication in sync on db1094 and db1098:3317 - T201011 [06:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:43] (03PS1) 10Muehlenhoff: Add component/node10 for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/460787 (https://phabricator.wikimedia.org/T203239) [06:56:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094 and db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460788 [06:57:14] (03CR) 10Muehlenhoff: [C: 032] Add component/node10 for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/460787 (https://phabricator.wikimedia.org/T203239) (owner: 10Muehlenhoff) [06:57:49] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094 and db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460788 (owner: 10Marostegui) [06:58:49] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094 and db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460788 (owner: 10Marostegui) [07:01:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 and db1098:3317 - T201011 (duration: 00m 49s) [07:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:07] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [07:04:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 and db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460786 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [07:04:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094 and db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460788 (owner: 10Marostegui) [07:06:33] (03PS3) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [07:15:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460792 (https://phabricator.wikimedia.org/T201011) [07:15:37] 10Operations, 10Discovery-Search, 10Elasticsearch: Modify elasticsearch_shard_size_check plugin to display only indices and shard size - https://phabricator.wikimedia.org/T204363 (10Mathew.onipe) [07:15:49] 10Operations, 10Discovery-Search, 10Elasticsearch: Resolve elasticsearch shard size alert - https://phabricator.wikimedia.org/T204362 (10Mathew.onipe) [07:16:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460792 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [07:17:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460792 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [07:18:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460792 (https://phabricator.wikimedia.org/T201011) (owner: 10Marostegui) [07:18:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 and db1099:3318 - T201011 (duration: 00m 49s) [07:18:46] !log Stop replication in sync on db1092 and db1099:3318 - T201011 [07:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:51] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [07:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092, db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460797 [07:22:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092, db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460797 (owner: 10Marostegui) [07:23:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092, db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460797 (owner: 10Marostegui) [07:24:14] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) >>! In T204267#4584397, @Smalyshev wrote: > All bans are temporary, so as soon as traffic returns to normal the bans wi... [07:24:45] !log Deploy schema change on s8 - T201011 [07:24:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 and db1099:3318 - T201011 (duration: 00m 49s) [07:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] T201011: Apply schema change to translate_reviews in WMF - https://phabricator.wikimedia.org/T201011 [07:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:06] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Maros... [07:29:12] !log force power on for ms-be2030 [07:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:54] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Maros... [07:31:20] !log repair sdj on ms-be2040 - T199198 [07:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:28] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:33:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092, db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460797 (owner: 10Marostegui) [07:35:38] RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 36.63 ms [07:40:00] zeljkof, pinging you about T204243 :) [07:40:01] T204243: Throttle exemption for event in Ireland - https://phabricator.wikimedia.org/T204243 [07:40:29] Urbanecm: ok, on it in a minute [07:40:36] thank you [07:45:51] (03PS2) 10Muehlenhoff: Revert "sre.switchdc.mediawiki: parsoid skip broken host" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [07:45:57] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) [07:46:18] !log repooled wtp1043 (T196886) [07:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:26] T196886: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 [07:50:14] <_joe_> I'm going to merge ^^, I removed a duplicate line with respect to the version Luca gave +1 to [07:50:31] Urbanecm: something urgent came up, I'll be back in 10-20 minutes or so [07:50:45] (03CR) 10Muehlenhoff: [C: 032] Revert "sre.switchdc.mediawiki: parsoid skip broken host" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [07:50:56] (03CR) 10Muehlenhoff: [V: 032 C: 032] Revert "sre.switchdc.mediawiki: parsoid skip broken host" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [07:50:58] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [07:51:50] <_joe_> meh, delaying submitting that [07:51:55] <_joe_> gtg [07:59:59] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [08:04:11] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) I've experimented with `librdkafka` settings and setting `queue.buffering.max.ms=50` plus `batch.num.messages=1000` I've been able to reac... [08:07:10] !log stop and reimage db1062 (may generate s7 lag) [08:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] Urbanecm: ok, I'll deploy the throttle rule for T204243 now [08:20:06] T204243: Throttle exemption for event in Ireland - https://phabricator.wikimedia.org/T204243 [08:25:32] (03CR) 10Zfilipin: [C: 032] "The event takes place before EU SWAT, so deploying now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) (owner: 10Urbanecm) [08:27:19] (03Merged) 10jenkins-bot: New throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) (owner: 10Urbanecm) [08:29:24] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:460432|New throttle rule for enwiki event (T204243)]] (duration: 00m 49s) [08:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:32] T204243: Throttle exemption for event in Ireland - https://phabricator.wikimedia.org/T204243 [08:30:10] Urbanecm: deployed! thanks for the reminder [08:30:44] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) [08:30:46] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) [08:30:48] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 [08:30:50] (03PS1) 10Jcrespo: mariadb: Reimage db1068 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460857 (https://phabricator.wikimedia.org/T204311) [08:30:52] (03PS1) 10Jcrespo: mariadb: Disable notifications on db1068 [puppet] - 10https://gerrit.wikimedia.org/r/460858 (https://phabricator.wikimedia.org/T204311) [08:32:54] (03CR) 10jenkins-bot: New throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) (owner: 10Urbanecm) [08:43:54] !log installing postgresql security updates on maps* [08:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:58] are we still in time for a cherry-pick for congresslookup? [08:54:12] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1068 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460857 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [08:54:47] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications on db1068 [puppet] - 10https://gerrit.wikimedia.org/r/460858 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [08:56:03] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Sebastian_Berlin-WMSE) I noticed that a [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/tr... [08:59:51] 10Operations, 10ops-eqiad: Heating alerts on kafka1014 - https://phabricator.wikimedia.org/T204479 (10MoritzMuehlenhoff) [09:07:01] (03PS3) 10Jcrespo: mariadb: Reenable db1062 notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460480 [09:09:55] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable db1062 notifications after reimage [puppet] - 10https://gerrit.wikimedia.org/r/460480 (owner: 10Jcrespo) [09:14:12] (03PS1) 10Muehlenhoff: Also list component/node10 in udeb-components [puppet] - 10https://gerrit.wikimedia.org/r/460863 [09:16:23] (03PS1) 10MarcoAurelio: Enable WikidataPageBanner for glwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) [09:16:31] !log stop and reimage db1068 (may generate s4 lag) [09:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:47] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications on db1068" [puppet] - 10https://gerrit.wikimedia.org/r/460865 [09:19:06] (03CR) 10Muehlenhoff: [C: 032] Also list component/node10 in udeb-components [puppet] - 10https://gerrit.wikimedia.org/r/460863 (owner: 10Muehlenhoff) [09:22:35] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1068.eqiad.wmnet'] ``` The log can be found in `/... [09:28:06] !log Deploy schema change on s7 eqiad master (db1062) - T187089 [09:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:14] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [09:30:38] (03CR) 10ArielGlenn: "> maybe we need to make a difference between a role for both types of" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [09:33:27] (03PS2) 10Ema: Remove cache_misc definitions [puppet] - 10https://gerrit.wikimedia.org/r/460219 (https://phabricator.wikimedia.org/T164609) [09:34:33] (03CR) 10Ema: [C: 032] Remove cache_misc definitions [puppet] - 10https://gerrit.wikimedia.org/r/460219 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:38:09] (03PS1) 10Banyek: mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 [09:39:11] (03CR) 10Marostegui: "Add the task number to the commit message so it gets tracked on the task" [puppet] - 10https://gerrit.wikimedia.org/r/460867 (owner: 10Banyek) [09:40:38] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815 (10phuedx) [09:42:15] (03PS2) 10Banyek: mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 (https://phabricator.wikimedia.org/T203565) [09:42:31] (03CR) 10Marostegui: [C: 031] mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [09:43:43] (03CR) 10Banyek: [C: 032] mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [09:44:10] I now deploy the change for silent the host [09:44:18] nothere [09:45:02] (03PS1) 10Gilles: Disable thumbnail prerendering on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460868 (https://phabricator.wikimedia.org/T204478) [09:46:10] (03PS3) 10Banyek: mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 (https://phabricator.wikimedia.org/T203565) [09:46:16] (03CR) 10Banyek: [V: 032 C: 032] mariadb: disable notifications for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460867 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [09:47:41] (03CR) 10ArielGlenn: "> i think i'd prefer not using the arrow syntax to declare" [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [09:50:25] !log Deploy schema change on db1062 - T203548 [09:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:33] T203548: Remove partitions from s7 masters (db1062 and db2040) for metawiki.pagelinks - https://phabricator.wikimedia.org/T203548 [09:53:55] 10Operations, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10MoritzMuehlenhoff) nodejs 10 packages for stretch-wikimedia are now available in the repository component "component/node10... [09:57:26] (03PS1) 10Banyek: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) [10:00:16] (03CR) 10Marostegui: [C: 04-1] maridadb: depool db1114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:00:24] :( [10:01:59] (03PS2) 10Banyek: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) [10:03:13] (03CR) 10jerkins-bot: [V: 04-1] maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:07:52] (03PS3) 10Banyek: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) [10:10:29] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1068.eqiad.wmnet'] ``` and were **ALL** successful. [10:11:43] 10Operations, 10Phabricator, 10Traffic: Allow traffic team to manage the traffic blog on phame - https://phabricator.wikimedia.org/T204355 (10ema) >>! In T204355#4584223, @Dzahn wrote: > I hit edit on https://phabricator.wikimedia.org/phame/blog/edit/11/ I have the feeling this will end like those calls t... [10:19:22] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Marostegui) [10:22:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [10:22:51] (03PS2) 10Ema: Remove maps-lb definitions [dns] - 10https://gerrit.wikimedia.org/r/460274 (https://phabricator.wikimedia.org/T164608) [10:23:24] (03CR) 10Ema: [C: 032] Remove maps-lb definitions [dns] - 10https://gerrit.wikimedia.org/r/460274 (https://phabricator.wikimedia.org/T164608) (owner: 10Ema) [10:27:56] !log START addshore@mwmaint2001:~$ mwscript refreshLinks.php --wiki wikidatawiki --namespace 146 --e 56031711 54387042 [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:14] !log START addshore@mwmaint2001:~$ mwscript refreshLinks.php --wiki wikidatawiki --namespace 146 --e 56031711 54387042 # T195302 T195302 [10:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:23] T195302: Language, Lexical Category, and Grammatical features of a lexeme do not show up in Special:WhatLinksHere - https://phabricator.wikimedia.org/T195302 [10:28:24] !log START addshore@mwmaint2001:~$ mwscript refreshLinks.php --wiki wikidatawiki --namespace 146 --e 56031711 54387042 # T195301 T195302 [10:28:29] what a messy log entry.... [10:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:33] T195301: Search on test.wikidata.org - https://phabricator.wikimedia.org/T195301 [10:28:57] (03CR) 10Marostegui: [C: 04-1] maridadb: depool db1114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [10:30:04] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180917T1030). [10:31:04] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:31:30] !log stopped my maint script [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:36] * addshore looks at the dashboard [10:31:42] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460873 (https://phabricator.wikimedia.org/T128546) [10:33:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:34:22] jouncebot: next [10:34:22] In 0 hour(s) and 25 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180917T1100) [10:34:25] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460873 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:42] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460873 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:55] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460873 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:05] !log starting my maint script again [10:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:02] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [10:40:02] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:460873|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:10] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:52] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:460873|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:36] (03PS4) 10Banyek: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) [10:44:12] (03CR) 10Mathew.onipe: "https://puppet-compiler.wmflabs.org/compiler1002/12476/elastic1019.eqiad.wmnet/change.elastic1019.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [10:49:31] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto) [10:50:42] hashar: did you see how CI is broken for keyholder? [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180917T1100). [11:00:05] revi, Hauskatze, and gilles: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] o/ [11:00:30] oi, need 5 mins [11:00:35] o/ [11:00:38] I can swat today [11:01:21] (03CR) 10Marostegui: [C: 031] "This is ok, but we need to wait for the SWAT to be done before we can depool" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [11:01:21] o/ [11:01:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/460321 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [11:01:51] do other's patch first :) [11:01:55] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/460322 (owner: 10Muehlenhoff) [11:02:32] (03PS4) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [11:02:34] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Disable notifications on db1068" [puppet] - 10https://gerrit.wikimedia.org/r/460865 (owner: 10Jcrespo) [11:02:41] (03PS2) 10Jcrespo: Revert "mariadb: Disable notifications on db1068" [puppet] - 10https://gerrit.wikimedia.org/r/460865 [11:04:54] (03CR) 10Mobrovac: [C: 031] Replace the semver patch version in Accept with 0 [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [11:05:35] zeljkof: you can start with me then given that revi needs 5', unless he's returned [11:05:45] now ready :o [11:06:00] ok, go ahead [11:06:30] (03PS2) 10Revi: Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) [11:06:34] rebasing [11:06:45] gilles: you're a deployer, right? want to deploy your patch yourself? [11:06:52] 10Operations, 10ops-eqiad: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 (10MoritzMuehlenhoff) [11:07:19] revi: I'll review your patch in a minute, getting ready for swat [11:07:23] zeljkof: sure, let me know when [11:07:23] kk [11:07:27] (opening all the tabs/windows) [11:07:37] gilles: do you need a long time to test? if not, you can go first [11:08:04] no, it's quick to verify [11:08:18] gilles: in that case, go ahead, let me know when you are done [11:08:56] (03CR) 10Gilles: [C: 032] Disable thumbnail prerendering on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460868 (https://phabricator.wikimedia.org/T204478) (owner: 10Gilles) [11:09:45] revi: please stand by, I'm reviewing your patch, you are next, after gilles [11:09:48] kk [11:10:29] (03PS5) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [11:10:52] (03Merged) 10jenkins-bot: Disable thumbnail prerendering on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460868 (https://phabricator.wikimedia.org/T204478) (owner: 10Gilles) [11:11:28] (03PS3) 10Revi: Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) [11:14:44] (03CR) 10Zfilipin: [C: 031] Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) (owner: 10Revi) [11:16:03] (03CR) 10Zfilipin: [C: 031] Enable WikidataPageBanner for glwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) (owner: 10MarcoAurelio) [11:17:02] !log installing curl security updates on trusty (Debian already fixed) [11:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204478 Disable thumbnail prerendering on private wikis (duration: 00m 49s) [11:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:37] T204478: ThumbnailRender job fails on private wikis - https://phabricator.wikimedia.org/T204478 [11:18:56] (03CR) 10jenkins-bot: Disable thumbnail prerendering on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460868 (https://phabricator.wikimedia.org/T204478) (owner: 10Gilles) [11:19:07] Hauskatze: a question about CongressLookup commits [11:19:16] shoot [11:19:16] zeljkof: I'm done [11:19:31] gilles: great! I'll take over swat [11:19:34] * revi is ready [11:19:54] revi: I'll ping you in a few minutes when your patch is at mwdebug1002 for testing [11:19:55] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) [11:19:57] revi is revi [11:19:59] :P [11:20:01] lol [11:20:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) (owner: 10Revi) [11:21:09] Hauskatze: sorry, got distracted, so patches for CongressLookup, are they really necessary? [11:21:28] (03Merged) 10jenkins-bot: Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) (owner: 10Revi) [11:21:50] I mean, can't they wait the train? [11:21:59] (03PS1) 10Jcrespo: mariadb: Reimage db1061 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460884 (https://phabricator.wikimedia.org/T204311) [11:22:01] (03PS1) 10Jcrespo: mariadb: Disable notifications on db1061 [puppet] - 10https://gerrit.wikimedia.org/r/460885 (https://phabricator.wikimedia.org/T204311) [11:22:04] (sorry, got disconnected for a second) [11:22:17] (so maybe I've missed an answer) [11:22:25] btw I thought we were on codfw not eqiad [11:22:42] zeljkof: I'd say yes, given that Special:NetNeutrality is still up and we're potentially targetting people to contact a dead Senator [11:22:53] the train will take a couple more of days [11:23:06] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1061 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460884 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [11:23:15] Hauskatze: ok, makes sense, I didn't know if it's urgent, hence the questiohn [11:23:17] question [11:23:31] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications on db1061 [puppet] - 10https://gerrit.wikimedia.org/r/460885 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [11:24:07] Hauskatze: so, both patches can be deployed at the same time, right? [11:24:22] revi: the patch is at mwdebug1002, please test and let me know if I can deploy it [11:24:34] yes, zeljkof; I don't think they'd conflict each other [11:24:35] kk [11:24:43] (03PS2) 10Zfilipin: Enable WikidataPageBanner for glwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) (owner: 10MarcoAurelio) [11:24:55] I can test both at meta on mwdebug(tellmewhatnumber) once they're there [11:25:49] Hauskatze: do you prefer to merge/test them one by one? patches look pretty simple, that's why I've proposed to merge them [11:26:11] zeljkof: FYI, due to ongoing maintenance, while we try to impact minimally production, we have to break eqiad databases from time to time [11:26:13] Looks good, zeljkof [11:26:19] zeljkof: should be good to merge both [11:26:34] revi: ok, deploying [11:26:41] I would suggest, at least on this first week, to use as debug codfw mw hosts [11:26:44] Hauskatze: ok, will merge them now, not sure how long it will take [11:27:03] nothing bad can happen, but you could get false positive errors on eqiad [11:27:06] patientia in regulis nostrum prima virtus est [11:27:38] zeljkof: do you prefer me to send an email to ops- ? [11:27:43] jynus: thanks for letting me know, so which hosts should I use? [11:27:49] * Hauskatze wonders why we're still using deploy1001 if we're on codfw now [11:27:51] jynus: yes, please do [11:28:05] jynus: this is all I know about debug/canary hosts https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [11:28:08] mwdebug2001 or 2002, wich I am not sure if real hosts or just dnames [11:28:26] zeljkof: there is 2 canaries on codfw, too [11:29:02] jynus: can you write it in e-mail and/or update swat docs? https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [11:29:09] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460622|Enable FileExport on Korean Wikipedia (T204399)]] (duration: 00m 50s) [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:17] T204399: enable FileExport on Korean Wikipedia - https://phabricator.wikimedia.org/T204399 [11:29:22] jynus: I'm not the only deployer, so others get the message [11:29:40] revi: the patch is deployed, please test and thanks for deploying with #releng ;) [11:29:58] yes,I will, I am trying to find the right host numbers [11:29:59] thanks [11:30:48] jynus: thanks! [11:31:05] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Available_backends [11:31:30] and works on prod [11:31:36] ^I will now on an email suggest to use mwdebug2001 and 2 (aka 2017 snd 2099) [11:31:40] jynus: thanks! I knew I've seen the list somewhere [11:31:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:32:14] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) (owner: 10MarcoAurelio) [11:32:56] Hauskatze: is there a phab task for CongressLookup commits? [11:33:04] zeljkof: no [11:33:17] just regular cleanup [11:33:26] or I think I did opened one [11:33:28] let me check [11:33:31] (03Merged) 10jenkins-bot: Enable WikidataPageBanner for glwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) (owner: 10MarcoAurelio) [11:33:35] Hauskatze: ok, just checking, I'm not familiar with the extension [11:33:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:34:32] (03CR) 10jenkins-bot: Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) (owner: 10Revi) [11:34:34] (03CR) 10jenkins-bot: Enable WikidataPageBanner for glwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460864 (https://phabricator.wikimedia.org/T199713) (owner: 10MarcoAurelio) [11:34:34] zeljkof: T203611 [11:34:35] T203611: [CongressLookup] Update senators.json - https://phabricator.wikimedia.org/T203611 [11:34:45] Hauskatze: 460864 is at mwdebug1002 for testing [11:34:53] * Hauskatze checks [11:35:20] Hauskatze: can you please update commit messages, adding the task? [11:36:01] zeljkof: sure, and fwiw WikidataPageBanner appears on Special:Version for glwiki [11:36:17] Hauskatze: so 460864 is good to deploy? [11:37:09] zeljkof: yep [11:37:14] and done with the cmtmssgs [11:37:18] Hauskatze: ok, deploying [11:37:43] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460864|Enable WikidataPageBanner for glwiki (T199713)]] (duration: 00m 50s) [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:51] T199713: Add WikidataPageBanner extension to gl.wikipedia - https://phabricator.wikimedia.org/T199713 [11:37:51] Hauskatze: 460864 deployed, please check [11:38:35] done [11:38:38] works [11:41:07] Hauskatze: great, I've +2d both CongressLookup patches, they probably need 5-10 minutes to merge :/ [11:41:40] if we trust the zuul page, 20 mins each :/ :/ [11:41:47] argh, one job already failed :( [11:41:50] zeljkof: sent email to ops- with proper explanation [11:42:01] jynus: thanks! [11:42:56] nothing really changes, just stringly prefer codfw hosts [11:43:46] zeljkof: I can't understand that CI output, it's just a bunch of ~~crap~~ random stuff [11:44:09] Hauskatze: I think there was a problem with installing an npm package :/ will probably just work the next time [11:44:45] it has all the signs [11:45:07] if it was really a json failure npm test before merging would've caught it [11:45:18] yes [11:45:35] it's unlikely anything is broken, except a problem with the jenkins job [11:45:51] okay they finished [11:47:37] zeljkof: can we try to re+2 the failing one? [11:47:44] both passed zuul already [11:47:59] !log installing ghostscript security updates on stretch [11:48:01] Hauskatze: already done :D [11:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:29] it's passing now [11:50:36] (03PS2) 10Ema: Remove misc-web [dns] - 10https://gerrit.wikimedia.org/r/460275 (https://phabricator.wikimedia.org/T164609) [11:52:07] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10jcrespo) p:05Triage>03High [11:53:26] (03CR) 10Ema: [C: 032] Remove misc-web [dns] - 10https://gerrit.wikimedia.org/r/460275 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [11:53:38] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) [11:53:46] (03PS6) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [11:55:05] zeljkof: passed :D [11:55:19] Hauskatze: great, will pull it on a debug server [11:56:04] (03PS1) 10Jcrespo: mariadb: Disable notifications for db1071 [puppet] - 10https://gerrit.wikimedia.org/r/460888 (https://phabricator.wikimedia.org/T204311) [11:58:27] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) 05Open>03Resolved [11:58:33] (03PS1) 10Jcrespo: mariadb: Reimage db1071 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460889 (https://phabricator.wikimedia.org/T204311) [11:58:45] Hauskatze: CongressLookup patches are at mwdebug1002 [11:59:12] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1071 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460889 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [11:59:32] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for db1071 [puppet] - 10https://gerrit.wikimedia.org/r/460888 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [11:59:52] zeljkof: checking [12:00:19] zeljkof: works [12:00:30] Hauskatze: deploying [12:00:35] got to go [12:01:25] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/CongressLookup/: SWAT: [[gerrit:460860|Senator John McCain deceased (T203611)]] [[gerrit:460861|Add Sen. Jon Kyl (T203611)]] (duration: 00m 51s) [12:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:34] T203611: [CongressLookup] Update senators.json - https://phabricator.wikimedia.org/T203611 [12:02:05] !log EU SWAT finished [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:42] (03PS1) 10Jcrespo: mariadb: Disallow reimage to all hosts except the test db [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) [12:05:22] (03CR) 10Jcrespo: [C: 04-2] "Blocked on reimage finish." [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [12:08:47] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [12:08:52] !log stop and reimage db1071 (may generate s8 lag) [12:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:44] (03CR) 10Mathew.onipe: "https://puppet-compiler.wmflabs.org/compiler1002/12479/elastic1019.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [12:13:22] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1071.eqiad.wmnet'] ``` The log can be found in `/... [12:16:24] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Krenair) [12:26:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:29:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:31:14] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.3706 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:53:33] (03CR) 10Banyek: [C: 032] maridadb: depool db1114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [12:53:46] !log depooling db1114 [12:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:42] (03Merged) 10jenkins-bot: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [12:56:14] (03PS4) 10Daniel Kinzler: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) [12:57:15] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3836 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:58:29] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T203565: depooling db1114 (duration: 00m 49s) [12:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:36] T203565: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 [12:59:35] PROBLEM - High lag on wdqs2003 is CRITICAL: 3626 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:01:14] (03PS5) 10Daniel Kinzler: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) [13:07:11] (03CR) 10jenkins-bot: maridadb: depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460869 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:12:48] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) The old instances have been powered off for a few days now and afaict there have been no problems. I think it's safe to remove them. [13:14:32] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1071.eqiad.wmnet'] ``` and were **ALL** successful. [13:15:19] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10jcrespo) [13:15:36] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.008593 https://grafana.wikimedia.org/dashboard/db/logstash [13:15:38] jynus: indeed that reimage generated a whole lot of errors from snapshot1008 about switching to read only mode heh [13:16:13] (03PS1) 10Banyek: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/460897 [13:16:15] (03PS1) 10Banyek: mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) [13:16:50] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/460897 (owner: 10Banyek) [13:19:09] (03CR) 10Marostegui: [C: 031] mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:19:55] (03CR) 10Jcrespo: "I am not sure that syntax is valid bash- bash != proper perl-compatible regular expressions. I would stick to "db1114|db1118)". Not I don'" [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:20:02] (03CR) 10Jcrespo: [C: 04-1] mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:22:15] (03CR) 10Marostegui: [C: 04-1] "See above" [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:23:17] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.00354 https://grafana.wikimedia.org/dashboard/db/logstash [13:28:27] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications for db1071" [puppet] - 10https://gerrit.wikimedia.org/r/460901 [13:28:42] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications on db1061" [puppet] - 10https://gerrit.wikimedia.org/r/460902 [13:28:57] (03PS2) 10Banyek: mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) [13:29:19] (03CR) 10Jcrespo: "Blocked on T204493 and reimage finally happening." [puppet] - 10https://gerrit.wikimedia.org/r/460902 (owner: 10Jcrespo) [13:29:23] (03CR) 10Marostegui: [C: 031] mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:30:06] !log removed intel-microcode 3.20180703 from apt.wikimedia.org/stretch-wikimedia (superseded by new release shipped via security.debian.org) [13:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:38] (03CR) 10Jcrespo: [C: 031] mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:31:09] (03CR) 10Banyek: [C: 032] mariadb: Wipe srv partition on db1114 [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:31:43] (03CR) 10Jcrespo: [C: 031] "Sorry for these, but one loses later a lot of time debugging these issues later - I know by experience- (the error we get on network confi" [puppet] - 10https://gerrit.wikimedia.org/r/460898 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:33:06] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1752 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:33:47] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2611 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:34:15] 10Operations, 10ops-eqiad: Heating alerts on kafka1014 - https://phabricator.wikimedia.org/T204479 (10elukey) We can stop the host and verify the status of the thermal paste if it is worth it :) [13:34:26] looking into logstash alerts [13:38:06] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10jcrespo) [13:39:40] !log removed intel-microcode 3.20180703 from apt.wikimedia.org/jessie-wikimedia (superseded by new release shipped via security.debian.org) [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:25] !log repair sdl on ms-be2041 - T199198 [13:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:33] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:40:37] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.02522 https://grafana.wikimedia.org/dashboard/db/logstash [13:41:05] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10bmansurov) >>! In T203039#4574768, @Pchelolo wrote: > Another consideration is that here we're distributing pre-learned AI model, I believe there should be industry... [13:41:24] (03PS2) 10Muehlenhoff: Switch role for cumin2001 to role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/460321 (https://phabricator.wikimedia.org/T177385) [13:42:28] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.04441 https://grafana.wikimedia.org/dashboard/db/logstash [13:42:43] (03PS8) 10Ema: Replace the semver patch version in Accept with 0 [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [13:42:52] (03CR) 10Muehlenhoff: [C: 032] Switch role for cumin2001 to role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/460321 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [13:43:32] looks like ~75% of log events in the past 4 hour window are from snaptshot1008 throwing "Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode” errors. seems to be clearing already [13:43:48] (03PS9) 10Ema: Replace the semver patch version in Accept with 0 [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [13:44:10] why was it read-only? [13:44:27] (03CR) 10Ema: [C: 032] Replace the semver patch version in Accept with 0 [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [13:47:45] the error mentions all replicas lagged but not sure why [13:47:56] (03PS1) 10Muehlenhoff: Whitelist cumin2001 for profile::mariadb::wmf_root_client [puppet] - 10https://gerrit.wikimedia.org/r/460904 [13:48:17] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1651 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:49:06] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/460904 (owner: 10Muehlenhoff) [13:49:10] logstash input rate and packet loss are already trending downward so this should clear soon [13:49:17] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [13:50:35] (03CR) 10Muehlenhoff: [C: 032] Whitelist cumin2001 for profile::mariadb::wmf_root_client [puppet] - 10https://gerrit.wikimedia.org/r/460904 (owner: 10Muehlenhoff) [13:56:26] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:57:26] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:03:52] (03PS1) 10Banyek: MariaDB: mute notifications of db1119 [puppet] - 10https://gerrit.wikimedia.org/r/460906 (https://phabricator.wikimedia.org/T203565) [14:06:14] (03CR) 10Marostegui: [C: 031] MariaDB: mute notifications of db1119 [puppet] - 10https://gerrit.wikimedia.org/r/460906 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:06:16] (03PS1) 10Banyek: Labsdb: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) [14:07:20] (03PS1) 10Volans: cumin: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/460908 (https://phabricator.wikimedia.org/T177385) [14:07:48] (03CR) 10Jcrespo: Labsdb: depool db1119 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:08:00] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [14:08:48] (03CR) 10Marostegui: "This is not labsdb." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:08:58] 10Operations, 10Cloud-VPS (Project-requests): Request removal of puppet3-diffs VPS project - https://phabricator.wikimedia.org/T204532 (10herron) [14:09:21] 10Operations, 10Cloud-VPS (Project-requests): Request removal of puppet3-diffs VPS project - https://phabricator.wikimedia.org/T204532 (10herron) [14:09:24] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) [14:09:35] (03PS2) 10Banyek: db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) [14:10:10] (03CR) 10Marostegui: [C: 031] db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:10:53] (03PS2) 10Jcrespo: Revert "mariadb: Disable notifications for db1071" [puppet] - 10https://gerrit.wikimedia.org/r/460901 [14:12:32] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:13:32] (03CR) 10Jcrespo: [C: 031] "BTW, I usually use mariadb:, we should standarize on one of the 2, but low priority as long as it is correct" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:18:28] (03CR) 10Banyek: [C: 032] db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:18:58] (03CR) 10Banyek: [C: 032] MariaDB: mute notifications of db1119 [puppet] - 10https://gerrit.wikimedia.org/r/460906 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:19:45] 10Operations, 10Parsing-Team, 10RESTBase, 10RESTBase-API, and 2 others: Improve Accept header normalization in VCL for REST API - https://phabricator.wikimedia.org/T202682 (10mobrovac) 05Open>03Resolved a:03Pchelolo Since node's semver package does not deal with `.x` versions well, we settled for for... [14:20:15] (03Merged) 10jenkins-bot: db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:21:34] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10jcrespo) > http://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ Is 250K rows inserted per second f... [14:25:36] (03PS3) 10Jcrespo: Revert "mariadb: Disable notifications for db1071" [puppet] - 10https://gerrit.wikimedia.org/r/460901 [14:27:19] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T203565: depooling db1119 (duration: 00m 49s) [14:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:27] T203565: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 [14:27:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/460908 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [14:29:57] 10Operations, 10Maps-Sprint, 10Traffic, 10Maps (Tilerator), and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) This is now ready to be deployed to beta, but note that deploying to production is blocked on {T198622}. [14:29:59] (03CR) 10jenkins-bot: db-eqiad.php: depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460907 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [14:31:47] !log upgrading (kernel & mariadb) db1119 [14:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Disable notifications for db1071" [puppet] - 10https://gerrit.wikimedia.org/r/460901 (owner: 10Jcrespo) [14:36:15] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [14:40:29] (03CR) 10Volans: "compiler results seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/460908 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [14:40:55] (03PS2) 10Volans: cumin: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/460908 (https://phabricator.wikimedia.org/T177385) [14:41:40] (03CR) 10Volans: [C: 032] cumin: fix puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/460908 (https://phabricator.wikimedia.org/T177385) (owner: 10Volans) [14:42:15] jynus: there is an unmerged patch of yours on puppetmaster [14:42:17] can I merge it? [14:42:39] (03PS1) 10Thcipriani: Gerrit: Add CoC and Privacy Policy to old UI [puppet] - 10https://gerrit.wikimedia.org/r/460914 (https://phabricator.wikimedia.org/T196835) [14:48:09] marostegui: Would sometime in the next hour or two be a good time to start writing the new actor tables (and stop writing old comment fields) on test wikis and mediawiki.org? Or should I wait for another time? [14:49:26] anomie: fine by me, I am expecting to logoff after the SRE meeting (in 2 hours) though, so I won't be monitoring much :-) [14:51:16] anomie: test wikis can happen any time [14:51:28] Just being sure [14:53:46] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:03] "Caught INT; exiting" [14:55:36] someone ran puppet and hit cntrl-c? [14:57:30] no issues on rerun [14:58:28] (03CR) 10Dzahn: "i agree. in a seemingly unrelated change i am doing this here: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460605/ to unblock R" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [14:58:56] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:59:15] (03PS2) 10Muehlenhoff: Add cumin2001 to network constants and tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/460322 [15:00:59] (03CR) 10Muehlenhoff: [C: 032] Add cumin2001 to network constants and tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/460322 (owner: 10Muehlenhoff) [15:02:35] Hmm, I never enabled those on Beta Cluster. I should do that first. [15:03:13] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10bmansurov) @jcrespo 250K rows/sec sounds great. Batch import speed per se is not too important — I just don't want to wait hours to load data up like I did in a lab... [15:05:39] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) [15:08:52] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10akosiaris) Can't reproduce this. https://grafana.wikimedia.org/dashboard/db/phabricator?orgId=1&from=1536592054293&to=1537196854293 is pointing out that apache was restarted today so this is probably why. If this is... [15:09:14] (03PS1) 10Anomie: Set actor and comment schema migrations for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460916 (https://phabricator.wikimedia.org/T166733) [15:09:42] (03CR) 10Anomie: [C: 032] "Deploying schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460916 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [15:10:38] ^ That commit summary lied. Should say "config change". [15:11:01] (03Merged) 10jenkins-bot: Set actor and comment schema migrations for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460916 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [15:11:04] (03CR) 10Anomie: "> Patch Set 1: Code-Review+2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460916 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [15:12:22] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) Updated HP that it still is giving me the same "recharging" message 4 days later. [15:12:52] (03CR) 10jenkins-bot: Set actor and comment schema migrations for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460916 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [15:14:10] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10Cmjohnson) @jcrespo I need to power this server off let me know when I can do this [15:15:22] 10Operations, 10ops-eqiad: Heating alerts on kafka1014 - https://phabricator.wikimedia.org/T204479 (10Cmjohnson) @elukey yes please stop the host and I will apply thermal paste [15:16:21] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10greg) [15:16:27] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10greg) [15:18:24] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10jcrespo) Shutting down service... [15:19:00] (03PS1) 10Jcrespo: mariadb: Depool db1105 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 [15:19:47] !log db1061 (may generate s6 lag) for hw maintenance [15:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:05] !log stop db1061 (may generate s6 lag) for hw maintenance [15:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:55] !log db1069 replacing disk in slot 7 [15:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:30] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10jcrespo) db1061 should be now fully down and ready for you! @Cmjohnson [15:28:29] (03PS1) 10Ema: cache_misc: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/460922 (https://phabricator.wikimedia.org/T164609) [15:29:38] robh: ready to take over clinic duty? last week was smooth and all the queues are empty (let me know if I forgot something) [15:29:50] i have topic open right htis second actually [15:29:51] heh [15:30:15] XioNoX: if there are sre meeting reviews you will list? [15:30:32] robh: yeah there is one [15:30:51] see the access request part of the etherpad [15:31:17] PROBLEM - Host db1061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:45] ^that is expected, as menined 2 logs ago [15:31:48] *mentioned [15:31:54] (03CR) 10Elukey: [C: 031] "LGTM (analytics side)" [puppet] - 10https://gerrit.wikimedia.org/r/460922 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:34:24] (03CR) 10Jcrespo: "Will probably leave this (and other P7510 hosts) for tomorrow and focus on db1061 instead. FYI." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 (owner: 10Jcrespo) [15:36:28] RECOVERY - Host db1061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [15:39:18] 10Operations, 10ops-eqiad, 10DBA: Degraded disk on db1069 (x1 master) - https://phabricator.wikimedia.org/T204462 (10Marostegui) Disk replaced by @Cmjohnson ``` root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:7] -aALL Rebuild Progress on Device at Enclosure 32, Slot 7 Completed 2% in 1 Minutes. ``` [15:44:28] PROBLEM - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:44:29] ACKNOWLEDGEMENT - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T204539 [15:44:34] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T204539 (10ops-monitoring-bot) [15:44:54] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T204539 (10Marostegui) [15:44:57] 10Operations, 10ops-eqiad, 10DBA: Degraded disk on db1069 (x1 master) - https://phabricator.wikimedia.org/T204462 (10Marostegui) [15:48:03] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10jcrespo) mgmt interface works for me again, waiting for you to be finished to continue with my maintenance. [15:49:25] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Cmjohnson) [15:50:35] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Cmjohnson) @ayounsi I connected both mx204's I have in eqiad to the console and mgmt switch. cr2-eqord is on port 47 and the other is in port 48 and labled cr2-eqsin. [15:52:06] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10Cmjohnson) [15:52:10] 10Operations, 10ops-eqiad: db1061 management interface busy (no sessions allowed) - https://phabricator.wikimedia.org/T204493 (10Cmjohnson) 05Open>03Resolved I am done [15:52:18] (03CR) 10Filippo Giunchedi: [C: 031] cache_misc: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/460922 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:53:10] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) sent HP an updated AHS log at their request [16:04:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:06:01] (03PS2) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458610 (https://phabricator.wikimedia.org/T191086) [16:06:18] (03PS3) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458610 (https://phabricator.wikimedia.org/T191086) [16:06:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:08:01] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10jcrespo) [16:08:47] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:02] ^ godog [16:09:15] womp womp [16:09:20] I'll take a look after the meeting [16:09:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) Thank you for keeping touch with HP about this. This is dumb :( [16:10:48] RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 39.22 ms [16:11:59] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => WRITE_BOTH on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460923 (https://phabricator.wikimedia.org/T188327) [16:12:09] (03PS1) 10Anomie: Set CommentTableSchemaMigrationStage => WRITE_BOTH on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) [16:14:38] RECOVERY - Device not healthy -SMART- on db1069 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [16:15:02] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460923 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:15:26] 10Operations, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc1035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [16:16:10] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10jcrespo) The plan is for us dbas to test setting up a single API with the same structure than eqiad and do all assuming th... [16:16:16] 10Operations, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc1035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Worth to note that after the switchover to codfw the problem still persist, and it is located in the same shard: ``` Se... [16:16:19] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => WRITE_BOTH on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460923 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:18:03] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting wgActorTableSchemaMigrationStage = WRITE_BOTH on test wikis, mw.org (T188327) (duration: 00m 50s) [16:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:10] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [16:18:25] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10mmodell) Should we raise the frequency of the restarts? [16:20:34] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4589641, @jcrespo wrote: > The plan is for us dbas to test setting up a single API with the sam... [16:27:22] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => WRITE_BOTH on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460923 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:32:32] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:32:47] (03CR) 10Anomie: [C: 04-1] "Oops, wrong commit summary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:33:21] (03PS2) 10Anomie: Set CommentTableSchemaMigrationStage => WRITE_NEW on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) [16:33:44] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:34:56] (03Merged) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:35:15] 10Operations, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc1035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Krinkle do we have any plan to fix this from the mediawiki side? I am asking since I lost context in IRC between you an... [16:35:31] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10RobH) a:03mark This was not resolved in this weeks SRE meeting and instead will be reviewed directly by @mark. [16:35:34] 10Operations, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [16:36:18] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = WRITE_NEW on test wikis, mw.org (T166733) (duration: 00m 50s) [16:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:27] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [16:43:04] (03CR) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460924 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:45:31] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Krenair) p:05Triage>03Normal [16:45:41] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Krenair) [16:55:21] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Krinkle) [16:56:45] 10Operations, 10ops-eqiad, 10DBA: Degraded disk on db1069 (x1 master) - https://phabricator.wikimedia.org/T204462 (10Marostegui) 05Open>03Resolved RAID rebuilt correctly: ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAI... [17:00:04] gehel: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180917T1700). [17:02:50] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1061.eqiad.wmnet'] ``` The log can be found in `/... [17:04:16] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) [17:04:17] (03PS1) 10Dzahn: add wikipediavideo.com as parked domain [dns] - 10https://gerrit.wikimedia.org/r/460927 [17:05:07] RECOVERY - MegaRAID on db1069 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:05:12] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10User-Elukey: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Krinkle) I can see that the errors in the logs are indication of a problem somewhere. But it's unclear to me wh... [17:05:23] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Krinkle) [17:10:21] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@6091de6]: Regular WDQS deployment [17:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:06] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@6091de6]: Regular WDQS deployment (duration: 11m 45s) [17:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:28] PROBLEM - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:27:48] PROBLEM - MariaDB Slave IO: s2 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect [17:28:17] PROBLEM - MariaDB read only s2 on dbstore2002 is CRITICAL: Could not connect to localhost:3312 [17:29:09] (03PS11) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [17:29:18] RECOVERY - MariaDB read only s2 on dbstore2002 is OK: Version 10.1.35-MariaDB, Uptime 238s, read_only: True, 24.37 QPS, connection latency: 0.004549s, query latency: 0.000876s [17:29:59] (03CR) 10Bstorm: "Found some errors in local testing. Diffs with my local DB look correct." [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [17:31:14] (03PS12) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [17:38:19] 10Operations, 10DBA, 10Patch-For-Review: Upgrade all core (mediawiki) database servers to mariadb 10.1 - https://phabricator.wikimedia.org/T204311 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1061.eqiad.wmnet'] ``` and were **ALL** successful. [17:46:02] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:48:03] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:48:51] (03PS2) 10Dzahn: add wikipediavideo.com as parked domain [dns] - 10https://gerrit.wikimedia.org/r/460927 [17:54:20] (03PS1) 10BBlack: authdns-local-update: gdnsd-3.x compat [puppet] - 10https://gerrit.wikimedia.org/r/460937 [17:59:04] (03CR) 10Dzahn: [C: 032] add wikipediavideo.com as parked domain [dns] - 10https://gerrit.wikimedia.org/r/460927 (owner: 10Dzahn) [18:00:04] RoanKattouw and stephanebisson: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180917T1800). [18:00:04] stephanebisson, RoanKattouw, Ebe123, odder, and bmansurov: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] here [18:00:23] * Ebe123 is here (and ready) [18:00:24] * odder o/ [18:01:57] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Disable notifications on db1061" [puppet] - 10https://gerrit.wikimedia.org/r/460902 (owner: 10Jcrespo) [18:02:10] (03PS2) 10Jcrespo: Revert "mariadb: Disable notifications on db1061" [puppet] - 10https://gerrit.wikimedia.org/r/460902 [18:02:59] Hello [18:07:15] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 305 bytes in 0.016 second response time [18:07:58] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 271 bytes in 0.010 second response time [18:08:06] is that planned maintenance CC andrewbogott, bstorm_ ? [18:08:27] or unplanned? [18:08:28] yes, but it wasn't supposed to have impact...it is [18:08:36] sorry, can I help? [18:10:51] odder: Your change is on mwdebug1002. Can you test it? [18:11:04] stephanebisson: did you saw my email? [18:11:20] Stephanebisson, which test wiki? [18:11:26] *beta [18:11:30] jynus: No [18:11:37] about suggesting strongly to prefer mwdebug from codfw [18:11:46] stephanebisson: Yup, testing now [18:11:46] it is on ops- list [18:12:02] Oooh right because we switched [18:12:02] or you may have at times more errors than expected [18:12:18] Please read the ops- list about that, it is better explained there why [18:12:27] stephanebisson: Can confirm it looks like it's supposed to :-) [18:12:28] no breakage of any kind [18:12:36] but may give false positives [18:12:55] so I suggest using mw2017 or 99 instead [18:13:02] Maybe there should be a password to run scap that is sent out in random ops list postings [18:13:12] that way people will have to read :) [18:13:46] well, it may only have issues under a sum of circumstances [18:13:46] why does deploy2001 say "While it is perfectly working, this is not the active deployment server." if we're on codfw? [18:14:07] active deployment server is eqiad [18:14:16] but active dc is codfw [18:14:16] complexity :) [18:14:20] so better test that [18:14:54] !log sbisson@deploy1001 Synchronized static/images/project-logos: SWAT (duration: 00m 50s) [18:14:54] right now nothing is lagging so you should have no differences [18:15:17] odder: deployed [18:15:19] but I cannot assure that will be true all days all hours for the next 2 weeks [18:16:03] stephanebisson, which server should I use to test change 460938? [18:16:24] Ebe123: yours is not ready yet... waiting for jenkins. Will let you know. [18:18:37] stephanebisson: Hm, I still can't see it [18:18:43] stephanebisson: Did you do https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Purging ? [18:19:07] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 26.695 second response time [18:19:32] odder: no, will do [18:19:37] ^ people are looking at that [18:19:57] PROBLEM - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/secondary_cluster_showmount - 185 bytes in 0.305 second response time [18:20:07] (it's labs NFS, probably not a prod MW deployment blocker) [18:20:08] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [18:20:12] stephanebisson: Oh, okay, thanks :) [18:20:57] Hah, ssh'ing into mw2017 and mw2099 fails, but mwdebug2001 and mwdebug2002 work [18:21:09] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 356 bytes in 60.036 second response time [18:21:45] PROBLEM - Getent speed check on labstore1004 is CRITICAL: CRITICAL: getent group tools.admin failed [18:22:20] odder: I did the purges, let me know if it works as expected [18:23:03] It does perfectly, thanks very much [18:23:18] RECOVERY - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.043 second response time [18:23:32] More happy Wikipedians, success! \o/ [18:24:07] bmansurov: your change is on mw2017. Can you test? [18:24:22] stephanebisson: on it [18:24:55] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.748 second response time [18:25:19] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.189 second response time [18:25:28] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.014 second response time [18:26:18] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [18:27:06] Ebe123: your change is on mw2017. Can you test? [18:27:25] RECOVERY - Getent speed check on labstore1004 is OK: OK: getent group returns within a second [18:28:39] Will do [18:28:48] RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.413 second response time [18:29:57] Much better! [18:30:39] stephanebisson: was there no train last week? [18:30:49] bmansurov: no [18:31:00] stephanebisson: ohh i see. [18:31:17] stephanebisson: let's not deploy then [18:31:35] bmansurov: ok, abort mission [18:31:40] stephanebisson: thanks! [18:31:49] (03PS1) 10Sbisson: Revert "Enable logging for CitationUsage and CitationUsagePageLoad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460944 [18:31:56] no train because dc switchover [18:31:58] (03CR) 10Sbisson: [C: 032] Revert "Enable logging for CitationUsage and CitationUsagePageLoad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460944 (owner: 10Sbisson) [18:32:29] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) @fgiunchedi I open a case with HP and for now They said there is no engineer available to help me, I will be receiving a call back in an hour. Please see below for case information. Dear Mr Papaul... [18:32:30] Ebe123: So, working as expected? [18:32:33] Yes [18:32:37] apergos: ok thanks. will it run this week? [18:32:44] bmansurov: yes [18:32:47] I imagine so [18:32:54] stephanebisson: apergos ok thanks! [18:33:33] (03Merged) 10jenkins-bot: Revert "Enable logging for CitationUsage and CitationUsagePageLoad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460944 (owner: 10Sbisson) [18:34:50] (03CR) 10jenkins-bot: Revert "Enable logging for CitationUsage and CitationUsagePageLoad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460944 (owner: 10Sbisson) [18:34:59] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Score/includes/Score.php: T203560 (duration: 00m 50s) [18:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:06] T203560: Notice: Undefined index: qb4tlxyr.ogg in /srv/mediawiki/php-1.32.0-wmf.19/extensions/Score/includes/Score.php on line 507 - https://phabricator.wikimedia.org/T203560 [18:35:11] bmansurov: the deployment calendar is the de-facto record of truth: https://wikitech.wikimedia.org/wiki/Deployments#Near-term [18:35:29] Ebe123: deployed everywhere [18:36:09] Thank you! [18:36:14] (03PS2) 10Sbisson: Enable PageTriage AfC on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458583 (https://phabricator.wikimedia.org/T203184) [18:36:34] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458583 (https://phabricator.wikimedia.org/T203184) (owner: 10Sbisson) [18:37:51] (03Merged) 10jenkins-bot: Enable PageTriage AfC on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458583 (https://phabricator.wikimedia.org/T203184) (owner: 10Sbisson) [18:40:56] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T203184 (duration: 00m 50s) [18:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:03] T203184: Deploy PageTriage AfC to production - https://phabricator.wikimedia.org/T203184 [18:43:55] greg-g: ok, thanks [18:46:59] !log Starting mwscript extensions/PageTriage/maintenance/populateDraftQueue.php --wiki enwiki [18:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:29] (03CR) 10jenkins-bot: Enable PageTriage AfC on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458583 (https://phabricator.wikimedia.org/T203184) (owner: 10Sbisson) [18:55:01] !log Stopped mwscript extensions/PageTriage/maintenance/populateDraftQueue.php --wiki enwiki [18:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:27] !log Starting mwscript extensions/PageTriage/maintenance/populateDraftQueue.php --wiki enwiki [18:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:48] (03PS9) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [19:02:11] (03PS1) 10Urbanecm: Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) [19:07:41] marostegui, hi, ad T204292) To understand your commend precisely, you're fine with having tables created, but you prefer not just enabling the extension until back on eqiad? Am I right? [19:07:42] T204292: Extension:Translate for id.wikimedia.org website - https://phabricator.wikimedia.org/T204292 [19:15:45] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Stop oversampling Asian countries - https://phabricator.wikimedia.org/T204365 (10Imarlier) p:05Triage>03Normal [19:19:08] 10Operations, 10Performance-Team, 10hardware-requests: eqiad: (1) misc single cpu server allocation for perfomance browser testing - https://phabricator.wikimedia.org/T204589 (10RobH) p:05Triage>03Normal [19:23:42] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10Reedy) [19:24:54] 10Operations, 10Performance-Team, 10hardware-requests: eqiad: (1) misc single cpu server allocation for perfomance browser testing - https://phabricator.wikimedia.org/T204589 (10Imarlier) Particularly with reference to the sudo question: we have previously tried using bare metal servers with a default debian... [19:30:30] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10sbassett) p:05Triage>03Normal [19:33:32] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 52676 MB (10% inode=99%) [19:38:33] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10Dzahn) > Can someone add @sbassett to security@ please? done, based on "new security engineer" > I think apalmer@wikimedia.org needs to be removed can OIT please trigger the regular offboarding workflow to notify us? [19:40:10] (03PS1) 10Herron: mx: enable gnutls %SERVER_PRECEDENCE in exim [puppet] - 10https://gerrit.wikimedia.org/r/460961 (https://phabricator.wikimedia.org/T203260) [19:40:59] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:41:02] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [19:43:29] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) I am a bit confused by now - is the original problem because recentchanges is using a wrong host, or it's using... [19:45:36] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10herron) >>! In T203260#4590940, @gerritbot wrote: > [operations/puppet@production] mx: enable gnutls %SERVER_PRECEDENCE in exim I'll plan to merge the above tomorrow barring... [19:57:12] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) ``` !log jenkins: remove compiler02.puppet3-diffs.eqiad.wmflabs and compiler03.puppet3-diffs.eqiad.wmflabs from jenkins confi... [19:58:39] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) 05Open>03Resolved a:03herron Only think left to do now is remove the old puppet3-diffs project which is being tracked in T204532.... [20:00:13] RECOVERY - Disk space on elastic1028 is OK: DISK OK [20:00:53] rip jouncebot [20:07:13] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460965 [20:07:15] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460965 (owner: 10Reedy) [20:09:22] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460965 (owner: 10Reedy) [20:09:56] I'm going to deploy ores [20:10:16] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 03m 26s) [20:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:47] 10Operations, 10Performance-Team, 10hardware-requests: eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10Krinkle) [20:11:16] 10Operations, 10hardware-requests, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10Imarlier) [20:11:38] (03CR) 10Paladox: [C: 031] Gerrit: Add CoC and Privacy Policy to old UI [puppet] - 10https://gerrit.wikimedia.org/r/460914 (https://phabricator.wikimedia.org/T196835) (owner: 10Thcipriani) [20:12:34] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460965 (owner: 10Reedy) [20:13:17] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,pull_image,remove_container,run_podsandbox,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:14:17] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:15:47] !log ladsgroup@deploy1001 Started deploy [ores/deploy@ae96071]: PoolCounter support: Let's get the party started (T160692) [20:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:55] T160692: Use poolcounter to limit number of connections to ores uwsgi - https://phabricator.wikimedia.org/T160692 [20:17:59] I'm going to deploy MCS (aka mobileapps). [20:19:44] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@e0b7158]: Update mobileapps to d56e4cf [20:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:46] (03PS1) 10Andrew Bogott: region-migrate: copy an instance's security groups [puppet] - 10https://gerrit.wikimedia.org/r/460967 [20:21:28] (03CR) 10Andrew Bogott: [C: 032] region-migrate: copy an instance's security groups [puppet] - 10https://gerrit.wikimedia.org/r/460967 (owner: 10Andrew Bogott) [20:23:39] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Imarlier) a:03Krinkle [20:24:00] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@e0b7158]: Update mobileapps to d56e4cf (duration: 04m 16s) [20:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:01] Everything looks fine, moving to prod [20:30:54] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [20:33:00] akosiaris: I thought ores is active/active. it seems eqiad is not getting any traffic: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=1&fullscreen&orgId=1 [20:33:29] The 1 there is me requesting inside the node by curl :] [20:34:36] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1150 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:35:35] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10Imarlier) [20:37:53] (03PS1) 10Andrew Bogott: region-migrate: set the host ip in /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/460973 [20:39:19] (03CR) 10Andrew Bogott: [C: 032] region-migrate: set the host ip in /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/460973 (owner: 10Andrew Bogott) [20:40:47] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic: Pass on name of the node serving ORES requests as response header to the user - https://phabricator.wikimedia.org/T204600 (10Ladsgroup) [20:43:18] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) Engineer called at 2:20 PM CDT called ended at 2:37 PM CDT. He went over the AHS logs and didn't find any error or issue. His recommendation as I mentioned was to upgrade the BIOS and the controller.... [20:44:05] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@ae96071]: PoolCounter support: Let's get the party started (T160692) (duration: 28m 19s) [20:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:14] T160692: Use poolcounter to limit number of connections to ores uwsgi - https://phabricator.wikimedia.org/T160692 [20:45:19] It's using the lock manager [20:48:08] 04Critical Alert for device mr1-eqiad.wikimedia.org - Duplicate IP on mgmt network [20:48:57] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Imarlier) [20:49:14] 10Operations: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Reedy) [20:49:15] !log add email address to User:Lanhiaze [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:01] 10Operations: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Reedy) [20:50:35] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Imarlier) [20:51:05] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10MoritzMuehlenhoff) [20:52:36] The median of lock time for prod is half a millisecond <- awight [20:53:17] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10MoritzMuehlenhoff) I've added the "Datacenter-Switchover-2018" project as this was filed as a response to a question in the staff channel (where th... [20:56:01] !log andrew@deploy1001 Started deploy [horizon/deploy@3124052]: Disable unneeded network panels [20:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:35] !log andrew@deploy1001 Finished deploy [horizon/deploy@3124052]: Disable unneeded network panels (duration: 03m 34s) [20:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:47] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10Bawolff) >>! In T204590#4590919, @Dzahn wrote: >> Can someone add @sbassett to security@ please? > > done, based on "new security engineer" > >> I think apalmer@wikimedia.org needs to be removed > > can OIT please trigger... [21:19:27] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10GTirloni) While checking an email from Shinken about Puppet failing, I've noticed the following error. Adding it here in case it's related. ``` Sep 17 21:... [21:23:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-eqiad.wikimedia.org recovered from Duplicate IP on mgmt network [21:29:04] 10Operations, 10Release Pipeline, 10Epic, 10Release-Engineering-Team (Kanban), 10Services (watching): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hashar) [21:32:02] !log andrew@deploy1001 Started deploy [horizon/deploy@3124052]: Cleaning up from some by-hand hacks [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:11] !log andrew@deploy1001 Finished deploy [horizon/deploy@3124052]: Cleaning up from some by-hand hacks (duration: 00m 10s) [21:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:20] !log andrew@deploy1001 Started deploy [horizon/deploy@3124052]: Fighting with scap [21:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:30] !log andrew@deploy1001 Finished deploy [horizon/deploy@3124052]: Fighting with scap (duration: 00m 10s) [21:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: On Trusty and Jessie PHP yields: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/20-xhprof.ini on line 2 - https://phabricator.wikimedia.org/T135338 (10hashar) 05Open>03Resolved a:03hashar xhprof is n... [21:42:36] !log andrew@deploy1001 Started deploy [horizon/deploy@3124052]: Fighting with scap more [21:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:26] !log andrew@deploy1001 Finished deploy [horizon/deploy@3124052]: Fighting with scap more (duration: 01m 50s) [21:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:24] (03PS7) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) [22:10:44] 10Operations: Add sbassett to security@ - https://phabricator.wikimedia.org/T204590 (10Dzahn) I checked on mx1001 and it looks like it has been re-created meanwhile. [22:14:09] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) @elukey @20after4 sorry for not replying earlier here. i think the only missing step to failover to phab1002 is that we set a main... [22:23:31] (03CR) 10Dzahn: Icinga: add check_bfd check (part 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:27:26] (03CR) 10Dzahn: Icinga: add check_bfd check (part 1) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:28:55] (03CR) 10Dzahn: Icinga: add check_bfd check (part 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:32:49] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Dzahn) You can contact the -owner address directly. For list specific admin questions there is no need for a ticket. In this case "engineering-owner@lists.wikimedia.org" is the email... [22:34:11] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) a:03Dzahn [22:40:20] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Aklapper) @dzahn: That won't help if the only admin is on an extended leave. Like in this case. :) (I'd also like to see another admin added due to recent spamming.) [22:46:49] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Dzahn) > puppet-mailman.puppet.eqiad.wmflabs I logged in here. i can see apparently only myself and @matanya used logged in besides root ( over 3 years ago) per: ``` root@puppet-mailman:~# lastl... [22:53:52] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Dzahn) I will add Quiddity but please not this means taking access away from Rachel effectively. There is only one password per list and only she knows the current one. [22:58:10] (03CR) 10Ayounsi: Icinga: add check_bfd check (part 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [23:03:54] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Dzahn) I took the master password from pwstore and used that to login on the web UI for list admins. There i added Quiddity. Now in the footer of the listinfo page it says "Engineer... [23:04:43] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Dzahn) ..once Rachel gets back please share the new password.. or ask us to run that command again. [23:08:29] (03CR) 10Dzahn: [C: 031] "lgtm, just make sure after merge and puppet run the icinga config is still syntactically correct, with 'icinga -v /etc/icinga/icinga.cfg' " [puppet] - 10https://gerrit.wikimedia.org/r/370103 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [23:09:37] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Quiddity) 05Open>03Resolved a:03Quiddity Will do. Thank you :) (I've removed and filtered the spammer) [23:21:01] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Dzahn) > trusty-update lastlog and bash_history here looks like this was used by @Andrew for some puppet testing (vi /etc/puppet/puppet.conf in Oct 2017) and besides only @Muehlenhoff once logged... [23:22:36] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Dzahn) Both instances are shut down but not deleted yet. [23:23:01] 10Puppet, 10Cloud-VPS: cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Dzahn) a:03Dzahn [23:32:15] !log ms-be2042 - reparing xfs - (T199198) [23:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:23] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [23:42:07] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @RobH @ssastry the change above should unblock this. It refactors the puppet code to profiles and ensures there is only a single role on the parsoid t... [23:45:12] (03CR) 10Dzahn: [C: 031] "compiler shows noop on prod parsoid and only resource name changes on test parsoid. https://puppet-compiler.wmflabs.org/compiler1002/12484" [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [23:46:44] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) coincidentally this should also help with (the comments on) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460064/ where it is discussed wheth...