[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T0000). Please do the needful. [00:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:38] hello 4pm [00:01:33] greg-g: it's 00:00 [00:01:49] 00:01. [00:03:14] bd808: what you call it, it means something to my daily routine :) [00:03:18] whatever* [00:04:21] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:50] !log upgrade prometheus-varnish-exporter on cache boxes in codfw and eqiad - T150479 [00:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:02] T150479: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479 [00:06:51] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:09:31] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 5228.57 ms [00:09:40] eh? [00:09:46] google is down? :) [00:09:51] RECOVERY - Host google is UP: PING WARNING - Packet loss = 16%, RTA = 1692.99 ms [00:09:57] wb google [00:12:50] there is a "virtual" host google in icinga because we have checks to make sure our main domains are not in [[w:Google Safe Browsing]] and they are associated with it. this happens rarely but today at least twice [00:13:30] interesting [00:13:33] thanks mutante [00:14:48] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4813/" [puppet] - 10https://gerrit.wikimedia.org/r/322830 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:16:29] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#2852935 (10GWicke) [00:16:41] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:17:19] (03PS1) 10Tim Landscheidt: Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) [00:22:43] (03PS2) 10Andrew Bogott: Novaobserver: novaobserver isn't in the admin project. [puppet] - 10https://gerrit.wikimedia.org/r/325643 (https://phabricator.wikimedia.org/T150092) [00:32:21] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:32:50] (03PS1) 10Dzahn: installserver: move firewall include to http/proxy classes [puppet] - 10https://gerrit.wikimedia.org/r/325720 (https://phabricator.wikimedia.org/T132757) [00:32:57] hello. sorry i'm late. are we still swatting? [00:33:11] …is anyone swatting? [00:33:52] MatmaRex, I can [00:34:10] (03PS2) 10Dzahn: installserver: move firewall include to http/proxy classes [puppet] - 10https://gerrit.wikimedia.org/r/325720 (https://phabricator.wikimedia.org/T132757) [00:34:24] thanks [00:35:01] PROBLEM - carbon-cache@f service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed [00:35:11] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:36:27] odd, taking a look [00:36:33] godog: it says it was killed [00:38:01] mutante: ah, bah, I've restarted it [00:38:01] RECOVERY - carbon-cache@f service on graphite1003 is OK: OK - carbon-cache@f is active [00:38:11] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [00:38:24] it's been generally stable carbon-cache but we might as well stick restart=always if it happens again [00:38:30] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4814/" [puppet] - 10https://gerrit.wikimedia.org/r/325720 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:38:41] godog: alright, yea [00:40:57] (03CR) 10jenkins-bot: [V: 04-1] Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) (owner: 10Tim Landscheidt) [00:43:17] MaxSem: ping me to verify when it's live [00:43:37] MatmaRex, pulled on mwdebug1002 [00:45:19] MaxSem: works as expected on https://test.wikipedia.org/wiki/Special:UploadWizard (issue is only live in group0) [00:45:33] cool [00:48:19] !log maxsem@tin Synchronized php-1.29.0-wmf.5/extensions/UploadWizard: https://gerrit.wikimedia.org/r/#/c/325625/ (duration: 00m 46s) [00:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:57] MatmaRex, ^ [00:48:57] thanks [00:55:02] (03PS2) 10Tim Landscheidt: Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) [00:56:33] (03PS1) 10Dzahn: install: rename tftp_server role to just tftp and further cleanup [puppet] - 10https://gerrit.wikimedia.org/r/325725 [00:58:09] (03PS2) 10Dzahn: install: rename tftp_server role to just tftp and further cleanup [puppet] - 10https://gerrit.wikimedia.org/r/325725 (https://phabricator.wikimedia.org/T132757) [01:00:18] (03CR) 10jenkins-bot: [V: 04-1] Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) (owner: 10Tim Landscheidt) [01:01:09] !log dump debug and restart hhvm on mw1232 [01:01:11] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.072 second response time [01:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:51] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 70770 bytes in 0.143 second response time [01:24:07] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4815/ (fails are unrelated problems of the compiler and something template parsing prometheus)" [puppet] - 10https://gerrit.wikimedia.org/r/325725 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:24:23] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2853133 (10GWicke) >>! In T66214#2850204, @Ciencia_Al_Poder wrote: >>>! In T66214#2827486, @GWicke wrote: >> Since the need for explicit control should... [01:26:48] (03PS3) 10Dzahn: icinga: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325491 (owner: 10Tim Landscheidt) [01:26:59] (03CR) 10Dzahn: [C: 032 V: 032] icinga: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325491 (owner: 10Tim Landscheidt) [01:27:28] (03PS3) 10Dzahn: ipmi: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325492 (owner: 10Tim Landscheidt) [01:27:58] (03CR) 10Dzahn: [C: 032 V: 032] ipmi: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325492 (owner: 10Tim Landscheidt) [01:35:12] (03CR) 10Dzahn: [C: 031] "ugh, yea, let's not contribute to the monthly ddos of ieee.org :p this tries to download http://standards-oui.ieee.org/oui/oui.txt tons o" [puppet] - 10https://gerrit.wikimedia.org/r/325699 (https://phabricator.wikimedia.org/T152440) (owner: 10Filippo Giunchedi) [01:37:01] 06Operations, 10media-storage, 13Patch-For-Review: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2848229 (10Dzahn) The Debian bug says how it was pointless to download this all the time anyways a... [01:41:51] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2853177 (10scfc) Thanks, @hashar. On closer look, these are indeed for the most part `flake8` errors. (I find the... [01:51:45] (03PS1) 10Dzahn: install: (re)move remaining "role::installserver" [puppet] - 10https://gerrit.wikimedia.org/r/325728 (https://phabricator.wikimedia.org/T132757) [01:58:12] (03PS1) 10Tim Landscheidt: Move wikimedia-logo.svg to role module [puppet] - 10https://gerrit.wikimedia.org/r/325729 [01:59:55] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4816/" [puppet] - 10https://gerrit.wikimedia.org/r/325728 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [02:26:42] (03PS3) 10Dzahn: dataset: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325472 (owner: 10Tim Landscheidt) [02:50:21] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:54:51] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:58:53] !log upload prometheus-node-exporter 0.13.0~rc.2 to carbon - T152580 [02:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:06] T152580: rollout prometheus-node-exporter 0.13 - https://phabricator.wikimedia.org/T152580 [03:00:39] !log bounce uwsgi-graphite-web on graphite1003, using a lot of memory [03:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:39] gehel: when you get a chance I had a question on T145659 re: elasticsearch metrics still in ganglia, https://phabricator.wikimedia.org/P4571 could you take a look ? [03:16:39] T145659: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659 [03:19:21] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:22:51] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [03:25:21] 06Operations, 06Parsing-Team, 10uprightdiff, 13Patch-For-Review: Debian packaging for uprightdiff - https://phabricator.wikimedia.org/T152577#2853242 (10Legoktm) + #operations to help with review and then uploading of the package. For reference, currently it's being manually built on the visualdiff testin... [04:11:01] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [04:37:01] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:59:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:36:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [05:37:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BR [05:45:33] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2853347 (10Dzahn) With the last merges the "role installserver" is now history. It has been split into "dhcp", "http", "preseed", "proxy" and "tftp" all in modules/role/mani... [05:53:51] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:00:31] (03PS1) 10Dzahn: install: add 'preseed'-role to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325737 (https://phabricator.wikimedia.org/T132757) [06:01:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:07:35] (03PS1) 10Dzahn: install: add http & proxy roles on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) [06:11:21] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#931465 (10Dzahn) The traditional "installserver role" that did everything is gone since today. i split it into "dhcp", "http", "preseed", "proxy" and "tftp" all in modules/role/manifests/installserver/ that should all be fr... [06:12:05] (03PS1) 10Gergő Tisza: Whitelist TSG for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) [06:12:54] (03CR) 10jenkins-bot: [V: 04-1] Whitelist TSG for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) (owner: 10Gergő Tisza) [06:13:55] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#2853410 (10Dzahn) ``` node 'carbon.wikimedia.org' { role(installserver::tftp, installserver::dhcp, installserver::http, installserver::proxy, installserver::preseed, aptrepo::wiki... [06:14:53] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2853414 (10Dzahn) ``` node 'carbon.wikimedia.org' { role(installserver::tftp, installserver::dhcp, installserver::http, installserver::proxy,... [06:17:27] (03PS2) 10Gergő Tisza: Whitelist TSG for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) [06:17:43] (03CR) 10Gergő Tisza: "Oh nice, mixed case option keys." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) (owner: 10Gergő Tisza) [06:20:05] (03PS1) 10Dzahn: install: add http & proxy roles on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/325743 (https://phabricator.wikimedia.org/T132757) [06:21:51] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:23:38] (03CR) 10Dzahn: [C: 031] delete keys in files/ppa/ [puppet] - 10https://gerrit.wikimedia.org/r/318451 (owner: 10Dzahn) [06:23:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [06:24:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4755554 keys, up 36 days 22 hours - replication_delay is 46 [06:29:21] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=628.30 Read Requests/Sec=349.70 Write Requests/Sec=748.10 KBytes Read/Sec=44530.40 KBytes_Written/Sec=7057.60 [06:30:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:30:01] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:30:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [06:38:21] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.00 Read Requests/Sec=0.20 Write Requests/Sec=0.40 KBytes Read/Sec=3.60 KBytes_Written/Sec=15.60 [06:42:31] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[screen] [07:10:21] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:20:19] (03PS2) 10Marostegui: mariadb: Added gtid_domain_id variable [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) [07:22:19] (03CR) 10Marostegui: [C: 032] mariadb: Added gtid_domain_id variable [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [07:25:40] (03CR) 10Elukey: Initial debianization (031 comment) [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/325568 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [07:26:37] (03PS2) 10Elukey: Initial debianization [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/325568 (https://phabricator.wikimedia.org/T147316) [07:28:40] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2853484 (10Marostegui) 05Open>03Resolved All good now - thanks Papaul! ``` hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F6F50) Gen8 ServBP 12+2 at P... [07:31:28] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2853486 (10Joe) >>! In T152074#2851136, @GWicke wrote: >> the issue we're seeing here is excessive request rate... [08:09:12] 06Operations, 10ops-codfw, 10DBA: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2853507 (10Marostegui) 05Open>03Resolved This is good now - thanks Papaul! ``` root@db2042:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 001438031205F10)... [08:14:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325745 (https://phabricator.wikimedia.org/T148967) [08:24:40] !log Deploy ALTER table db2023 (codfw master) wikidatawiki.revision - T150644 [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [08:28:49] (03CR) 10Jcrespo: [C: 031] "Good call. Check the processlist, normally both vslow and dump take a long time to depool and we may be on dumping times." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325745 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [08:29:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325745 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [08:29:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325745 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [08:30:09] (03PS4) 10Gehel: contint: Install php7.0-ast for phan [puppet] - 10https://gerrit.wikimedia.org/r/315711 (https://phabricator.wikimedia.org/T132636) (owner: 10Legoktm) [08:32:41] (03CR) 10Gehel: [C: 032] contint: Install php7.0-ast for phan [puppet] - 10https://gerrit.wikimedia.org/r/315711 (https://phabricator.wikimedia.org/T132636) (owner: 10Legoktm) [08:32:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 - T148967 (duration: 02m 10s) [08:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:12] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [08:37:18] !log Stop MySQL db2048 for maintenance - T149553 [08:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:29] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [08:38:31] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:40] (03PS1) 10Jcrespo: admin-jynus-dotfiles: Add some improvements on quality of life [puppet] - 10https://gerrit.wikimedia.org/r/325746 [08:42:19] (03PS2) 10Jcrespo: admin-jynus-dotfiles: Add some improvements on quality of life [puppet] - 10https://gerrit.wikimedia.org/r/325746 [08:56:34] (03PS4) 10ArielGlenn: dataset: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325472 (owner: 10Tim Landscheidt) [08:57:48] (03CR) 10ArielGlenn: [C: 032] dataset: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325472 (owner: 10Tim Landscheidt) [08:58:44] !log Deploy ALTER table db1045 dewiki.revision - T148967 [08:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:56] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [09:00:15] morning :D [09:00:25] morning elukey :-) [09:01:08] (03Abandoned) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 (owner: 10ArielGlenn) [09:03:01] (03Abandoned) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [09:04:12] (03Abandoned) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 (owner: 10ArielGlenn) [09:06:31] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:07:21] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2853580 (10Gehel) [09:08:04] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2637292 (10Gehel) There is no reason to duplicate elasticsearch metrics in both graphite and prometheus. Let's just not port those metrics. [09:19:13] 06Operations, 10DBA: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853592 (10jcrespo) [09:19:36] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853604 (10jcrespo) [09:19:50] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853592 (10jcrespo) [09:19:53] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2853606 (10jcrespo) [09:19:55] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853607 (10Marostegui) Should we merge these two: https://phabricator.wikimedia.org/T152427 ? [09:20:51] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2853611 (10jcrespo) [09:20:53] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2853608 (10jcrespo) 05Open>03Resolved a:03jcrespo The ongoing issues are now resolved. Long term fixes will go on T152188. [09:21:09] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2853612 (10jcrespo) a:05jcrespo>03Marostegui [09:24:25] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853615 (10jcrespo) Sorry, I didn't see that one, my fault entirely, but you should have added me as subscriber. [09:25:14] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853616 (10Marostegui) Ah sorry - I thought that by adding the project DBA it would add you automatically. My bad! [09:25:18] 06Operations, 10DBA, 10Monitoring: Implement TLS expiration/validation checking for MariaDB certificates - https://phabricator.wikimedia.org/T152595#2853621 (10jcrespo) [09:25:41] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2853623 (10jcrespo) [09:26:04] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2847847 (10jcrespo) [09:27:18] (03PS3) 10Yuvipanda: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [09:27:36] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2853628 (10jcrespo) [09:31:26] (03PS3) 10Yuvipanda: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [09:33:57] (03PS4) 10Jcrespo: Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) [09:44:09] (03CR) 10Jcrespo: [C: 032] Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) (owner: 10Jcrespo) [09:53:03] (03PS1) 10Yuvipanda: dynamicproxy: Bind on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/325751 [09:53:19] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Bind on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/325751 (owner: 10Yuvipanda) [09:54:05] (03PS2) 10Yuvipanda: dynamicproxy: Bind on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/325751 [09:55:20] (03CR) 10Alexandros Kosiaris: [C: 031] dynamicproxy: Bind on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/325751 (owner: 10Yuvipanda) [09:55:45] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Bind on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/325751 (owner: 10Yuvipanda) [10:01:10] (03PS4) 10Alexandros Kosiaris: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 [10:01:12] (03PS4) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [10:01:14] (03PS4) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [10:01:16] (03PS3) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [10:03:42] !log restart and upgrade of dbstore1001 T152188 [10:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [10:06:53] (03PS3) 10ArielGlenn: Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) (owner: 10Tim Landscheidt) [10:08:22] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2853664 (10ArielGlenn) I'm taking a look. [10:12:58] (03CR) 10ArielGlenn: [C: 032] Fix flake8 errors [software] - 10https://gerrit.wikimedia.org/r/325715 (https://phabricator.wikimedia.org/T152549) (owner: 10Tim Landscheidt) [10:15:04] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2853671 (10ArielGlenn) Looks like that fixed it. [10:19:20] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2853674 (10hashar) https://gerrit.wikimedia.org/r/#/c/325715/ fix flake8 newly introduced lint E305. The other erro... [10:20:31] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:38] (03CR) 10Alex Monk: [C: 04-1] "see task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) (owner: 10Gergő Tisza) [10:40:40] (03PS5) 10Alexandros Kosiaris: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 [10:40:42] (03PS5) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [10:40:44] (03PS5) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [10:40:46] (03PS4) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [10:41:21] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:49:31] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:50:23] (03CR) 10Yuvipanda: [C: 04-1] Kube-proxy: Amend to support more than labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [10:52:32] (03PS6) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [10:52:34] (03PS6) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [10:52:36] (03PS5) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [11:03:18] (03PS7) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [11:03:20] (03PS7) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [11:03:22] (03PS6) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [11:03:24] (03CR) 10Alex Monk: [C: 031] Novaobserver: novaobserver isn't in the admin project. [puppet] - 10https://gerrit.wikimedia.org/r/325643 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [11:09:21] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:11:11] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.169 second response time [11:13:11] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.102 second response time [11:17:45] (03PS1) 10Jcrespo: Fixes to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [11:23:23] (03CR) 10Alex Monk: [C: 031] mediawiki: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325476 (owner: 10Tim Landscheidt) [11:23:53] (03PS6) 10Alexandros Kosiaris: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 [11:23:55] (03PS8) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [11:23:57] (03PS8) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [11:23:59] (03PS7) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [11:25:14] (03CR) 10Marostegui: [C: 031] Fixes to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:25:17] (03Abandoned) 10Alexandros Kosiaris: puppetmaster: remove hiera for the labtest realm [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [11:25:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As mentioned on IRC, we should not use the public certs here, but use ones signed by the puppet CA. Working on it :)" [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) (owner: 10Giuseppe Lavagetto) [11:28:20] (03CR) 10Alexandros Kosiaris: [C: 032] "works in the tools environment, tested with Yuvi, merging" [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [11:28:35] (03CR) 10Alexandros Kosiaris: [C: 032] "works in the tools environment, tested with Yuvi, merging" [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [11:30:20] (03CR) 10Alex Monk: openstack: Fix puppet URLs in comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [11:31:24] (03PS1) 10Jcrespo: mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 [11:32:05] (03CR) 10Alex Monk: "I'd prefer to have stewards manage these." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325531 (https://phabricator.wikimedia.org/T152489) (owner: 10MarcoAurelio) [11:32:32] (03CR) 10jenkins-bot: [V: 04-1] mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 (owner: 10Jcrespo) [11:32:57] (03PS2) 10Jcrespo: mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 [11:37:58] (03CR) 10Marostegui: [C: 031] mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 (owner: 10Jcrespo) [11:39:06] (03CR) 10Jcrespo: [C: 031] "The predump fix is a blocker for backups to work again (they failed since 1 December due to TLS expiration + new configuration changes)." [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:43:58] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2853756 (10ArielGlenn) @demon, as much as I hate submodules, that might be a good idea anyways. I don't know if it makes a huge amount of difference whether we have the e... [11:52:07] (03PS1) 10Hashar: Use local tox instead of installing a new one [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) [11:55:05] (03CR) 10Hashar: "This way ones consistently uses the local tox installation instead of latest from pypi." [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [12:04:32] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:48] (03PS1) 10Jcrespo: mariadb: allow colorized mysql output [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/325763 [12:11:37] (03CR) 10Marostegui: [C: 031] "+1000 and it is time to update my dotfiles to include --pager by default!" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/325763 (owner: 10Jcrespo) [12:14:29] (03PS1) 10Marostegui: mariadb: Move eventlogging class to a single file [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) [12:17:05] (03CR) 10Jcrespo: [C: 032] mariadb: allow colorized mysql output [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/325763 (owner: 10Jcrespo) [12:19:42] (03CR) 10Marostegui: [C: 04-2] "fix this first: https://puppet-compiler.wmflabs.org/4817/db1046.eqiad.wmnet/change.db1046.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) (owner: 10Marostegui) [12:22:23] (03PS3) 10Jcrespo: admin-jynus-dotfiles: Add some improvements on quality of life [puppet] - 10https://gerrit.wikimedia.org/r/325746 [12:27:51] (03PS4) 10Jcrespo: admin-jynus-dotfiles: Add some improvements on quality of life [puppet] - 10https://gerrit.wikimedia.org/r/325746 [12:27:54] (03PS2) 10Marostegui: mariadb: Move eventlogging class to a single file [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) [12:28:26] (03CR) 10Marostegui: [C: 031] "I think I will copy those :-)" [puppet] - 10https://gerrit.wikimedia.org/r/325746 (owner: 10Jcrespo) [12:28:51] (03CR) 10Jcrespo: [C: 032] admin-jynus-dotfiles: Add some improvements on quality of life [puppet] - 10https://gerrit.wikimedia.org/r/325746 (owner: 10Jcrespo) [12:32:31] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:33:41] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:11] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:39:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 604 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4769434 keys, up 37 days 4 hours - replication_delay is 604 [12:41:55] (03PS3) 10Marostegui: mariadb: Move eventlogging class to a single file [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) [12:42:53] (03PS1) 10ArielGlenn: generate separate mysql config with list of private wikis [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/325765 (https://phabricator.wikimedia.org/T152100) [12:49:10] (03PS1) 10ArielGlenn: for sanitarium hosts, include a separate mysql cnf with private wikis [puppet] - 10https://gerrit.wikimedia.org/r/325766 (https://phabricator.wikimedia.org/T152100) [12:50:31] (03PS3) 10BBlack: varnish: make PURGE more efficient [puppet] - 10https://gerrit.wikimedia.org/r/324270 [12:51:35] (03CR) 10BBlack: [C: 032 V: 032] varnish: make PURGE more efficient [puppet] - 10https://gerrit.wikimedia.org/r/324270 (owner: 10BBlack) [12:52:11] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [12:53:01] (03PS4) 10Marostegui: mariadb: Move eventlogging class to a single file [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) [12:53:51] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/jynus/.profile] [12:54:01] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/jynus/.profile] [12:54:30] (03PS1) 10Jcrespo: Revert "admin-jynus-dotfiles: Add some improvements on quality of life" [puppet] - 10https://gerrit.wikimedia.org/r/325767 [12:54:51] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/jynus/.profile] [12:54:54] (03CR) 10Jcrespo: [C: 032 V: 032] Revert "admin-jynus-dotfiles: Add some improvements on quality of life" [puppet] - 10https://gerrit.wikimedia.org/r/325767 (owner: 10Jcrespo) [12:55:02] (03PS2) 10Jcrespo: Revert "admin-jynus-dotfiles: Add some improvements on quality of life" [puppet] - 10https://gerrit.wikimedia.org/r/325767 [12:55:05] (03CR) 10Jcrespo: [V: 032] Revert "admin-jynus-dotfiles: Add some improvements on quality of life" [puppet] - 10https://gerrit.wikimedia.org/r/325767 (owner: 10Jcrespo) [12:56:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325768 [12:57:20] (03PS3) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 [12:57:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325768 (owner: 10Marostegui) [12:58:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325768 (owner: 10Marostegui) [12:58:30] (03PS6) 10BBlack: rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) [12:58:54] (03CR) 10BBlack: [C: 032 V: 032] rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [12:59:30] (03CR) 10Marostegui: "This now works: https://puppet-compiler.wmflabs.org/4822/" [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) (owner: 10Marostegui) [12:59:32] (03PS4) 10BBlack: misc: get rid of hash support and maintenance [puppet] - 10https://gerrit.wikimedia.org/r/324941 [12:59:36] (03CR) 10BBlack: [C: 032 V: 032] misc: get rid of hash support and maintenance [puppet] - 10https://gerrit.wikimedia.org/r/324941 (owner: 10BBlack) [12:59:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 - T148967 (duration: 00m 56s) [13:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:01:13] (03PS3) 10BBlack: simplify security_audit backend for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/324947 [13:01:18] (03CR) 10BBlack: [C: 032 V: 032] simplify security_audit backend for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/324947 (owner: 10BBlack) [13:01:41] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:03:31] RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag not a slave [13:03:51] RECOVERY - MariaDB Slave SQL: s1 on db2034 is OK: OK slave_sql_state not a slave [13:04:20] (03PS8) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [13:06:45] (03PS12) 10BBlack: VCL app_directors 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [13:06:47] (03PS9) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [13:06:49] (03PS12) 10BBlack: VCL app_directors refactor 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [13:11:43] (03PS5) 10Marostegui: mariadb: Split eventlogging, misc, monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) [13:21:01] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:21:51] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:22:51] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:23:51] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:24] (03CR) 10Jcrespo: [C: 04-1] "I think we should organize a bit things into a mariadb::common before splitting more things. For example, we do not want to repeat the sam" [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) (owner: 10Marostegui) [13:34:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4755906 keys, up 37 days 5 hours - replication_delay is 0 [13:36:06] (03PS1) 10Jcrespo: Revert "Revert "admin-jynus-dotfiles: Add some improvements on quality of life"" [puppet] - 10https://gerrit.wikimedia.org/r/325769 [13:36:13] (03PS2) 10Jcrespo: Revert "Revert "admin-jynus-dotfiles: Add some improvements on quality of life"" [puppet] - 10https://gerrit.wikimedia.org/r/325769 [13:37:03] aude_: do you want to run eu swat today? since you have a patch [13:37:16] asking for a friend ;) [13:38:21] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:32] i could [13:40:00] * aude looks at the patches [13:40:01] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 22 failures. Last run 2 minutes ago with 22 failures. Failed resources (up to 3 shown): Package[coreutils],Package[quickstack],Service[puppet],Service[rsyslog] [13:40:25] (03PS3) 10Jcrespo: Revert "Revert "admin-jynus-dotfiles: Add some improvements on quality of life"" [puppet] - 10https://gerrit.wikimedia.org/r/325769 [13:41:52] (03CR) 10Hashar: [C: 04-1] "Guess you want to put the key in the private git repo and use secret() to have puppet to retrieve it :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [13:41:57] (03CR) 10Jcrespo: [C: 032] Revert "Revert "admin-jynus-dotfiles: Add some improvements on quality of life"" [puppet] - 10https://gerrit.wikimedia.org/r/325769 (owner: 10Jcrespo) [13:45:31] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/jynus/.profile] [13:49:15] ^temporary failure [13:49:32] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:50:09] jouncebot: next [13:50:09] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T1400) [13:51:33] (03CR) 10Alexandros Kosiaris: Citoid: Add the wskey parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [13:52:11] hi hashar [13:52:51] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:53:37] (03PS6) 10Hashar: Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [13:54:40] aude: I have blindly rebased the patch to enable Wikibase #statements. Hopefully rebase is fine, but we never know :} [13:54:50] aude: looks straightforward [13:56:00] zeljkof: for today swat I have +2ed all the mediawiki related back ports and rebased the single mediawiki-config change [13:56:41] hashar: cool [13:56:47] aude: are you running the swat? [13:57:34] hashar: not sure who to ping about operations/puppet patch https://gerrit.wikimedia.org/r/#/c/324203/ [13:57:48] should I just post the link here and ask for reviews? [13:58:51] i can [13:59:24] (03PS2) 10Mobrovac: Citoid: Add the wskey parameter [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T1400). Please do the needful. [14:00:04] aharoni, dcausse, and aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:08] i can start with aharoni's patch [14:00:15] o/ [14:00:22] I CR+2 the mediawiki ones [14:00:36] they are in the CI gate still :/ [14:00:47] Shalom [14:00:48] ok [14:00:55] (03CR) 10Mobrovac: Citoid: Add the wskey parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [14:01:17] aharoni: alikoum salam :D [14:01:28] these are only needed for wmf5? [14:01:36] aharoni: dcausse ? [14:01:51] aude: cirrussearch patch is only for wmf5 yes [14:01:55] ok [14:03:26] Come to think of it, my patch is for Content Translation, which is only on Wikipedias, and I backported to wmf.5, and no Wikipedia has it yet. Which is OK, because it will deployed to Hebrew and Catalan in a few hours, but I'm not sure I'll be able to test in production now. [14:03:56] maybe on test.wikipedia? [14:05:11] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1335 [14:05:25] (03PS4) 10Zfilipin: ChromeDriver should be in PATH for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) [14:05:57] !log mobrovac@tin Starting deploy [changeprop/deploy@1c7c522]: (no message) [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:21] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:06:32] (03CR) 10Alexandros Kosiaris: [C: 031] "T1084!!! Nice. I'll populate ops/private and then merge this" [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [14:06:46] !log mobrovac@tin Finished deploy [changeprop/deploy@1c7c522]: (no message) (duration: 00m 48s) [14:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:08:03] aude - won't work, because it goes to the wiki in the language into which I translated, and test is not a language. [14:08:04] (03CR) 10Alexandros Kosiaris: [C: 032] ChromeDriver should be in PATH for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) (owner: 10Zfilipin) [14:08:27] we really gotta look at the MediaWiki test suite [14:08:41] ok :/ [14:08:44] those jobs are wayyyy too slow [14:10:11] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 1983042 Threads: 50 Questions: 285661304 Slow queries: 18705 Opens: 11083 Flush tables: 1 Open tables: 589 Queries per second avg: 144.052 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:11:07] i am getting the changes on mwdebug1001 though that's probably useless in this case [14:11:53] it's really slow to do scap pull there [14:11:57] for some reason [14:12:04] now it's ok [14:12:43] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: Add the wskey parameter [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [14:12:49] (03PS3) 10Alexandros Kosiaris: Citoid: Add the wskey parameter [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [14:12:51] (03CR) 10Alexandros Kosiaris: [V: 032] Citoid: Add the wskey parameter [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) (owner: 10Mobrovac) [14:13:55] now deploying to everywhere [14:14:37] !log aude@tin Synchronized php-1.29.0-wmf.5/extensions/ContentTranslation: Fix inline template editor (duration: 00m 50s) [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:14] now dcausse's patch [14:15:17] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2854015 (10akosiaris) @Cmjohnson Any news on this ? [14:16:01] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[grc] [14:16:41] dcausse: is this something that can be tested on mwdebug1001? [14:17:09] aude: I could if I can see the logs on this node [14:17:36] or at least we can check that search results are not totally broken [14:17:52] something with a redirect [14:18:09] aude: yes, will test a search [14:18:43] looks okay to me [14:19:04] aude: me too [14:19:05] !log restart and upgrade of dbstore200[12] T152188 [14:19:06] and the change makes sense [14:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [14:20:41] !log aude@tin Synchronized php-1.29.0-wmf.5/extensions/CirrusSearch: Fix undefined property (duration: 01m 00s) [14:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] (03Abandoned) 10Ladsgroup: First flake8 pass on LDAP [puppet] - 10https://gerrit.wikimedia.org/r/278271 (owner: 10Ladsgroup) [14:21:57] aude: thanks! [14:22:09] will check if the notice stop spamming logs [14:22:16] (03Abandoned) 10Ladsgroup: Flake8 on rolematcher [puppet] - 10https://gerrit.wikimedia.org/r/279148 (owner: 10Ladsgroup) [14:22:26] now the wikibase config change [14:22:51] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:16] (03CR) 10Aude: [C: 032] Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [14:24:01] (03Merged) 10jenkins-bot: Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [14:26:20] looks good on mwdebug1001 [14:27:37] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Enable statements parser function and lua on test wikis (duration: 00m 50s) [14:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] done :) [14:29:10] `\O/` [14:31:37] (03CR) 10MarcoAurelio: "> Is ipblock-exempt all that common on private wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325531 (https://phabricator.wikimedia.org/T152489) (owner: 10MarcoAurelio) [14:32:37] aude, hashar — is my CX thing deployed? should I now just wait patiently until the traing deployment of wmf.5? [14:32:50] (03PS1) 10Marostegui: mariadb: Enable gtid_domain_id - phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) [14:34:19] (03PS1) 10Mobrovac: Citoid: Change the default for wskey [puppet] - 10https://gerrit.wikimedia.org/r/325782 [14:35:31] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:37:27] (03CR) 10Mobrovac: [C: 031] "cherry-picked in beta, works" [puppet] - 10https://gerrit.wikimedia.org/r/325782 (owner: 10Mobrovac) [14:39:02] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4824/" [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:39:45] is someone doing maintenance on the ubuntu repository? [14:40:03] aharoni: yes [14:40:16] please check after the train [14:42:39] (03CR) 10Jcrespo: "Question- shouldn't we use a gtid_domain_id variable, in case we want to change it on puppet in the future?" [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:44:33] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: Change the default for wskey [puppet] - 10https://gerrit.wikimedia.org/r/325782 (owner: 10Mobrovac) [14:45:51] (03PS3) 10Andrew Bogott: Novaobserver: novaobserver isn't in the admin project. [puppet] - 10https://gerrit.wikimedia.org/r/325643 (https://phabricator.wikimedia.org/T150092) [14:48:46] (03CR) 10Andrew Bogott: [C: 032] Novaobserver: novaobserver isn't in the admin project. [puppet] - 10https://gerrit.wikimedia.org/r/325643 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [14:50:03] (03PS2) 10Andrew Bogott: Horizon: refresh apache anytime django is refreshed [puppet] - 10https://gerrit.wikimedia.org/r/325692 [14:50:51] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:50:56] it seemed a temporary upsteam issue only [14:51:27] (03CR) 10Andrew Bogott: [C: 032] Horizon: refresh apache anytime django is refreshed [puppet] - 10https://gerrit.wikimedia.org/r/325692 (owner: 10Andrew Bogott) [14:52:01] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:56:42] (03CR) 10Marostegui: "We could, but I wouldn't like to have a server_id value different from a gtid_domain_id, for consistency mainly." [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:57:32] !log mobrovac@tin Starting deploy [citoid/deploy@be710c7]: deploying WorldCat API support [14:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:15] !log mobrovac@tin Finished deploy [citoid/deploy@be710c7]: deploying WorldCat API support (duration: 00m 43s) [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:31] (03CR) 10Jcrespo: "Nobody argues that, but enforce it on puppet, not on the template. gtid_domain_id = server_id is wrong. Calculate both variables on puppet" [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:03:09] (03PS1) 10Yuvipanda: labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 [15:03:21] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:03:31] (03CR) 10jenkins-bot: [V: 04-1] labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 (owner: 10Yuvipanda) [15:05:01] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Move novaobserver (and novaadmin) users out of ldap - https://phabricator.wikimedia.org/T152215#2854130 (10Krenair) [15:06:21] (03PS2) 10Yuvipanda: labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 [15:07:14] aude, hashar, thanks! [15:07:26] jynus: marosteg1i ^ if you have the time :) [15:08:26] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 3 others: Provide read-only access to OpenStack APIs from WMF IP space - https://phabricator.wikimedia.org/T150092#2854135 (10Andrew) [15:08:28] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Move novaobserver (and novaadmin) users out of ldap - https://phabricator.wikimedia.org/T152215#2854133 (10Andrew) 05Open>03Resolved This should be resolved by https://gerrit.wikimedia.org/r/#/c/325371/ [15:19:25] (03Abandoned) 10Marostegui: mariadb: Enable gtid_domain_id - phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/325781 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:24:43] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854151 (10GWicke) @joe, the data you collected in T151702#2841177 was not actually about the original outage p... [15:28:30] (03PS1) 10Yuvipanda: labs: Provision password for labsdbaccounts in labstores [puppet] - 10https://gerrit.wikimedia.org/r/325789 [15:29:40] (03PS3) 10Yuvipanda: labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 [15:29:42] (03PS2) 10Yuvipanda: labs: Provision password for labsdbaccounts in labstores [puppet] - 10https://gerrit.wikimedia.org/r/325789 [15:30:44] (03CR) 10Yuvipanda: [C: 032] labs: Provision password for labsdbaccounts in labstores [puppet] - 10https://gerrit.wikimedia.org/r/325789 (owner: 10Yuvipanda) [15:30:49] (03CR) 10Yuvipanda: [V: 032] labs: Provision password for labsdbaccounts in labstores [puppet] - 10https://gerrit.wikimedia.org/r/325789 (owner: 10Yuvipanda) [15:32:52] (03PS1) 10Jcrespo: mariadb-labspuppet: Where does the mysql password come from? [puppet] - 10https://gerrit.wikimedia.org/r/325791 [15:33:45] (03PS2) 10Jcrespo: mariadb-labspuppet: Where does the mysql password come from? [puppet] - 10https://gerrit.wikimedia.org/r/325791 [15:36:46] (03PS3) 10Jcrespo: mariadb-labspuppet: remove references to a password that is not used [puppet] - 10https://gerrit.wikimedia.org/r/325791 [15:37:20] (03CR) 10Jcrespo: [C: 032 V: 032] mariadb-labspuppet: remove references to a password that is not used [puppet] - 10https://gerrit.wikimedia.org/r/325791 (owner: 10Jcrespo) [15:40:02] (03PS1) 10Andrew Bogott: mariadb: Get labspuppet password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325792 [15:42:12] (03CR) 10Jcrespo: [C: 031] mariadb: Get labspuppet password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325792 (owner: 10Andrew Bogott) [15:42:39] (03CR) 10Andrew Bogott: [C: 032] mariadb: Get labspuppet password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325792 (owner: 10Andrew Bogott) [15:42:51] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:11] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2854171 (10Gehel) [15:44:32] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2384405 (10Gehel) Tilerator notification is failing regularly on the maps-test cluster, which it the cluster where hourly updates are enabled. This is correlation, not causality, stil... [15:44:33] ACKNOWLEDGEMENT - HP RAID on db2034 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:2 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152608 [15:44:36] 06Operations, 10ops-codfw: Degraded RAID on db2034 - https://phabricator.wikimedia.org/T152608#2854174 (10ops-monitoring-bot) [15:44:36] (03PS1) 10Jcrespo: Revert "mariadb-labspuppet: remove references to a password that is not used" [puppet] - 10https://gerrit.wikimedia.org/r/325793 [15:44:50] (03CR) 10jenkins-bot: [V: 04-1] Revert "mariadb-labspuppet: remove references to a password that is not used" [puppet] - 10https://gerrit.wikimedia.org/r/325793 (owner: 10Jcrespo) [15:45:14] (03Abandoned) 10Jcrespo: Revert "mariadb-labspuppet: remove references to a password that is not used" [puppet] - 10https://gerrit.wikimedia.org/r/325793 (owner: 10Jcrespo) [15:45:21] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:31] (03PS1) 10Jcrespo: Revert "mariadb: Get labspuppet password from hiera" [puppet] - 10https://gerrit.wikimedia.org/r/325794 [15:45:41] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:43] (03CR) 10Jcrespo: [C: 032 V: 032] Revert "mariadb: Get labspuppet password from hiera" [puppet] - 10https://gerrit.wikimedia.org/r/325794 (owner: 10Jcrespo) [15:45:51] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:51] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:01] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:01] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:31] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:41] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:51] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:51] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:01] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:01] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:01] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:01] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:21] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:48:25] 06Operations, 10ops-codfw: Degraded RAID on db2034 - https://phabricator.wikimedia.org/T152608#2854179 (10Marostegui) 05Open>03Invalid This can be ignored for now: T149553#2854167 [15:49:01] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:15] (03CR) 10Jcrespo: [C: 031] "I'm ok with this- deploy this now and if in the future we have have the stronger authentication methods, we do an alter to accommodate the" [puppet] - 10https://gerrit.wikimedia.org/r/325785 (owner: 10Yuvipanda) [15:52:41] (03PS4) 10Yuvipanda: labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 [15:52:49] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Change backing store's schema to store pw hashes only [puppet] - 10https://gerrit.wikimedia.org/r/325785 (owner: 10Yuvipanda) [15:53:50] ^ YuviPanda do you want me to do the alter or you want? [15:54:01] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:54:19] jynus: I was going to just drop table and redo it, but you can do it too! [15:54:30] that is ok, too :-) [15:54:47] it is more complicated later, with data :-) [15:55:04] jynus: yeah! :D [15:55:12] jynus: if there was any data Id' totally just have you do it :) [15:55:16] thanks for working on that [15:55:23] I'm ok to do it only because it's empty [15:55:30] is there anything I can do to help? [15:55:33] jynus: np! thanks for the +1 [15:56:59] (03PS1) 10Elukey: [WIP] Yandex ClickHouse puppetization [puppet] - 10https://gerrit.wikimedia.org/r/325797 [15:57:20] (03PS2) 10Elukey: [WIP] Yandex ClickHouse puppetization [puppet] - 10https://gerrit.wikimedia.org/r/325797 (https://phabricator.wikimedia.org/T150343) [15:58:30] (03Restored) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [15:58:42] (03Restored) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (owner: 10BBlack) [15:59:00] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854212 (10Joe) # Both are "actual outages" # Is it possible that grafana only records successful requests to p... [16:02:11] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:02:49] (03PS13) 10BBlack: VCL app_directors 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [16:02:51] (03PS11) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [16:02:53] (03PS10) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [16:02:55] (03PS13) 10BBlack: VCL app_directors refactor 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:02:57] (03PS1) 10BBlack: Varnish: remove "varnish-be-rand" conftool service [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) [16:03:41] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854225 (10Joe) And, to add to what I just said: if this is "normal" given our current concurrency limits, well... [16:05:39] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854245 (10ssastry) @Joe, for every Parsoid parse request, Parsoid could make multiple M/W api requests. So, th... [16:06:11] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [16:07:36] jouncebot: next [16:07:36] In 2 hour(s) and 22 minute(s): Gerrit upgrade to 2.13.3 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T1830) [16:10:30] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854274 (10ssastry) This is just from one server, wtp1001 .. so, like x24 for how many requests Parsoid receive... [16:10:51] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:11:33] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854276 (10ssastry) Oops .. looks like @Joe and I were cranking logs at the same time. :) But, we have two ind... [16:11:41] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:55] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854279 (10Papaul) [16:14:41] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:14:51] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:14:51] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:15:01] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:15:02] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:15:39] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854296 (10Papaul) p:05Triage>03Normal [16:15:41] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:15:51] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:16:01] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:16:01] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:16:31] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:16:41] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:51] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:17:01] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:17:01] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:17:24] lots of "DatabaseMysqlBase::lock failed to acquire lock 'jobqueue-recycle-refreshLinksPrioritized" [16:19:36] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Yandex ClickHouse puppetization [puppet] - 10https://gerrit.wikimedia.org/r/325797 (https://phabricator.wikimedia.org/T150343) (owner: 10Elukey) [16:19:41] (03CR) 10Alexandros Kosiaris: [C: 031] MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [16:20:23] (03PS2) 10Jcrespo: backups: Fixes to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [16:20:39] (03PS3) 10Jcrespo: backups: Fix to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [16:20:41] (03PS1) 10Andrew Bogott: Make labspuppetbackend_mysql_password a hiera global [puppet] - 10https://gerrit.wikimedia.org/r/325800 [16:21:37] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854315 (10ssastry) But, one thing that my paste above shows that there were no retries from RB to Parsoid on w... [16:22:22] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854316 (10GWicke) @ssastry's data seems to roughly concur with grafana's. ~30 req/minute per node works out to... [16:22:42] (03PS2) 10Andrew Bogott: Make labspuppetbackend_mysql_password a hiera global [puppet] - 10https://gerrit.wikimedia.org/r/325800 [16:23:04] (03PS2) 10Dzahn: zuul: Fix puppet URL in comment correctly [puppet] - 10https://gerrit.wikimedia.org/r/325629 (owner: 10Tim Landscheidt) [16:24:14] (03CR) 10Dzahn: [C: 032] zuul: Fix puppet URL in comment correctly [puppet] - 10https://gerrit.wikimedia.org/r/325629 (owner: 10Tim Landscheidt) [16:24:22] (03PS4) 10Elukey: MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [16:24:24] (03CR) 10Andrew Bogott: [C: 032] Make labspuppetbackend_mysql_password a hiera global [puppet] - 10https://gerrit.wikimedia.org/r/325800 (owner: 10Andrew Bogott) [16:24:28] (03PS3) 10Andrew Bogott: Make labspuppetbackend_mysql_password a hiera global [puppet] - 10https://gerrit.wikimedia.org/r/325800 [16:24:52] (03PS3) 10Dzahn: udp2log: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325483 (owner: 10Tim Landscheidt) [16:26:22] (03CR) 10Alexandros Kosiaris: [C: 031] Make labspuppetbackend_mysql_password a hiera global [puppet] - 10https://gerrit.wikimedia.org/r/325800 (owner: 10Andrew Bogott) [16:32:38] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854352 (10Joe) @Gwicke it's more like 10 times that number (300 req/min/node) and about 120 reqs/s/node. And @... [16:34:53] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854359 (10Joe) @Gwicke to be very clear, it's the shower of requests during those few minutes that causes the... [16:36:21] PROBLEM - Host ms-be2025 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:28] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854362 (10Papaul) [16:39:38] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854369 (10ssastry) I think we are all on the same page now. I agree that it is that initial burst of template... [16:39:40] !log removing bits.w.o VHost from mediawiki apache config (https://gerrit.wikimedia.org/r/#/c/305536) [16:39:41] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:43] !log disabled puppet on mw1* hosts as prep step [16:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:22] (03CR) 10Dzahn: [C: 032] udp2log: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325483 (owner: 10Tim Landscheidt) [16:42:28] (03PS4) 10Dzahn: udp2log: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325483 (owner: 10Tim Landscheidt) [16:42:30] (03CR) 10Elukey: [C: 032] MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [16:42:33] (03PS5) 10Elukey: MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [16:42:41] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:01] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:01] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:34] (03CR) 10Elukey: [V: 032] MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [16:45:01] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:01] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:41] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:46:27] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854377 (10Joe) I think what would be effective mitigating the issue would be if changepropagation spread out t... [16:47:00] !log running puppet on some mw codfw appservers to check the new config [16:47:01] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:47:02] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:47:02] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:20] marostegui: --^ [16:47:24] is it normal? [16:48:31] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:41] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:01] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:03] they seem to complain for a missing hiera key --^ [16:49:31] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:05] grab the puppet-umbrella [16:50:32] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:41] RECOVERY - Host ms-be2025 is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [16:50:41] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:41] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:01] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:02] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:31] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:16] jynus: --^ [16:53:31] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:31] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:32] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:01] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854401 (10Joe) Also, note that we could even do as follows: leave all servers in the API cluster active in bot... [16:54:01] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:01] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:35] <_joe_> elukey: which hiera key? [16:54:39] !log force puppet run on mw2* hosts (10% batch-size) [16:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:54] _joe_ checking, I don't remember [16:55:26] labspuppetbackend_mysql_password [16:55:32] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:34] <_joe_> wat? [16:55:45] /etc/puppet/modules/role/manifests/mariadb.pp [16:56:01] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:56:20] andrewbogott ^ [16:56:31] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:56:31] <_joe_> ok it's missing in the private hiera data for codfw [16:56:52] only he knows which is the right password [16:57:01] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:20] I can revert, again? [16:57:21] <_joe_> jynus: grep can tell you as well, I guess [16:57:31] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:39] <_joe_> but I'd wait for andrew, yes [16:58:01] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:01] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:01] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:41] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:47] I'll fix it, one second... [16:59:01] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:07] (03PS4) 10Jcrespo: backups: Fix to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [16:59:31] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:01] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:38] _joe_, jynus, fixed. [17:00:42] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:01:00] thanks andrewbogott [17:01:01] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:01] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:11] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:25] <_joe_> thanks [17:02:01] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:01] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:21] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:01] RECOVERY - puppet last run on es2019 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:04:14] (still running puppet on mw2* hosts) [17:05:10] (03PS5) 10Jcrespo: backups: Fix to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [17:05:39] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2854433 (10Papaul) - RAID Controller and battery replacement complete. - Clean all logs Leaving this task open for now [17:06:31] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:43] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2854441 (10mehtab.ahmed) Issue not resolved yet! [17:08:38] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2854458 (10Aklapper) >>! In T138136#2854441, @mehtab.ahmed wrote: > Issue not resolved yet! That is why you see "Open" in the upper left corner below the... [17:08:41] !log Apache config changed on mw2*, tests look fine (apachectl -S does not show the vhost, apachectl -t is ok, apache-fast-test from tin is ok). Proceeding with eqiad [17:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:12] (03PS5) 10Dzahn: udp2log: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325483 (owner: 10Tim Landscheidt) [17:10:52] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854468 (10fgiunchedi) [17:11:15] proceeding with mw1* [17:11:39] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854279 (10fgiunchedi) @Papaul looks good, I've edited the hostnames. Please expose all disks to the OS, it'll be software raid-ed [17:11:41] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:12:01] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:12:01] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:12:48] (03PS1) 10Urbanecm: Enable SandboxLink at sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325803 (https://phabricator.wikimedia.org/T152609) [17:13:01] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:13:01] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:13:21] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2854478 (10mehtab.ahmed) Can't we get another free license font. [17:14:01] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:14:01] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:14:52] (03CR) 10Filippo Giunchedi: [C: 031] Initial debianization [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/325568 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [17:15:01] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:16:01] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:16:31] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:17:32] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:17:41] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:18:31] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:19:41] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:19:41] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:20:01] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:20:01] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:20:31] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:20:59] (03PS2) 10Urbanecm: Enable SandboxLink at sdwiki and sdwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325803 (https://phabricator.wikimedia.org/T152609) [17:21:31] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:21:40] 06Operations, 10ops-codfw: codfw: rack/setup 4 swift frontend - https://phabricator.wikimedia.org/T152612#2854521 (10fgiunchedi) [17:22:01] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:22:20] (03PS6) 10Jcrespo: backups: Fix to the predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [17:22:32] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:22:32] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:22:38] (03PS1) 10Chad: Moving group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325804 [17:22:52] (03CR) 10Chad: [C: 04-2] "For later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325804 (owner: 10Chad) [17:23:01] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:23:31] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:24:01] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:24:02] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2854524 (10Papaul) [17:25:09] !log puppet run completed on mw1* hosts (10% batch-size) [17:25:18] (03PS7) 10Jcrespo: backups: Fix & uniform predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [17:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:22] all right apache config change completed [17:25:28] (03PS8) 10Jcrespo: backups: Fix & uniform predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) [17:25:32] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:25:32] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:25:34] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#2854533 (10Gilles) a:03Gilles [17:25:41] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:44] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854534 (10ssastry) >>! In T152074#2854401, @Joe wrote: > Also, note that we could even do as follows: leave al... [17:26:01] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:26:01] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:26:01] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:26:01] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:26:11] jouncebot: next [17:26:11] In 1 hour(s) and 3 minute(s): Gerrit upgrade to 2.13.3 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T1830) [17:26:52] (03PS1) 10Andrew Bogott: Nova policy: Make more read-only activities globally available [puppet] - 10https://gerrit.wikimedia.org/r/325806 [17:28:01] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:28:01] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:28:31] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:28:54] (03CR) 10Jcrespo: [C: 032] backups: Fix & uniform predump and bpipe mysql method of backups [puppet] - 10https://gerrit.wikimedia.org/r/325759 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [17:29:01] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:30:01] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:30:01] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:30:11] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:30:21] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:30:48] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#2854573 (10Krinkle) a:03Krinkle [17:31:12] (03CR) 10Gehel: lvs: add logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [17:33:56] !log forced retry of dbstore1001 backups [17:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:31] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:51] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:40:11] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [17:41:17] (03PS1) 10Jcrespo: backups: set --defaults-extra-file as the first parameter [puppet] - 10https://gerrit.wikimedia.org/r/325808 [17:41:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [17:41:52] there is a large spike of 5XX [17:42:11] it is going down, I think [17:42:11] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:42:59] (03CR) 10Jgreen: [C: 032] add donatetowikipedia.[com|org] as parked domain [dns] - 10https://gerrit.wikimedia.org/r/325706 (owner: 10Dzahn) [17:43:20] (03CR) 10Jgreen: [C: 031 V: 031] add donatetowikipedia.[com|org] as parked domain [dns] - 10https://gerrit.wikimedia.org/r/325706 (owner: 10Dzahn) [17:43:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4765386 keys, up 37 days 9 hours - replication_delay is 645 [17:43:46] (03CR) 10Jcrespo: [C: 032] backups: set --defaults-extra-file as the first parameter [puppet] - 10https://gerrit.wikimedia.org/r/325808 (owner: 10Jcrespo) [17:45:11] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [1000.0] [17:45:43] jynus: what the hell [17:49:33] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854710 (10GWicke) >>! In T152074#2854352, @Joe wrote: > @Gwicke it's more like 10 times that number (300 req/m... [17:52:11] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:52:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:54:29] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2837406 (10Pchelolo) > Could you point out where either the logs or grafana show a spike in request rates from... [17:54:41] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:01:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:40] (03PS1) 10Jcrespo: backups: Eliminate quotes on mysql/mysqldump execution [puppet] - 10https://gerrit.wikimedia.org/r/325810 [18:02:51] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:02:56] (03CR) 10Jcrespo: [C: 032] backups: Eliminate quotes on mysql/mysqldump execution [puppet] - 10https://gerrit.wikimedia.org/r/325810 (owner: 10Jcrespo) [18:03:11] !log add zareen to nda LDAP group, per https://phabricator.wikimedia.org/T149211 [18:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:42] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4754174 keys, up 37 days 9 hours - replication_delay is 0 [18:08:47] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2854773 (10jcrespo) We finally have the backups again up and running, with one day of delay. Reminder: check that all complete ok. [18:10:20] !log will be bouncing some main-eqiad kafka brokers to try to troubleshoot T142430 [18:11:48] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:07] T142430: Ensure no dropped messages in eventlogging producers when stopping broker - https://phabricator.wikimedia.org/T142430 [18:13:15] 06Operations, 10DBA, 10Monitoring: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2854794 (10jcrespo) So, the main issue is that we can have problems like this: ``` 07-Dec 02:05 helium.eqiad.wmnet JobId 4293... [18:14:01] 06Operations, 10DBA, 10Monitoring: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2854796 (10jcrespo) a:05jcrespo>03None [18:16:27] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2854801 (10GWicke) The CP processing graph for the second outage referenced by @joe agrees with the RB and Pars... [18:17:10] Dereckson: there's a bunch of server-side uploads pending, I wonder if we should reserve some time to do them [18:17:27] hi mafk [18:17:42] I can handle them this evening if you wish. [18:17:43] bonsoir [18:18:13] Let me some time to eat a little bit, then I'll prepare these uploads. [18:18:30] Well, I have no access to terbium so it needs to be done by someone who has and I've found a comment of yours saying that files uploaded to v2c2 expire after few days [18:18:55] maybe we should have Herald categorize them as high priority cough cough [18:23:26] (03PS1) 10Ottomata: Increase sync_timeout=10.0 for eventbus service kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/325813 (https://phabricator.wikimedia.org/T142430) [18:23:53] mafk: zhuyifei1999_ raised the duration [18:24:10] ? [18:24:14] good to know [18:24:46] oh v2c [18:27:49] (03PS2) 10Ottomata: Increase sync_timeout=10.0 for eventbus service kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/325813 (https://phabricator.wikimedia.org/T142430) [18:29:16] (03CR) 10Ottomata: [C: 032] Increase sync_timeout=10.0 for eventbus service kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/325813 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [18:30:01] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:30:04] ostriches and godog: Respected human, time to deploy Gerrit upgrade to 2.13.3 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T1830). Please do the needful. [18:30:21] oohz thiz gerrit upgradez [18:32:08] godog: I'm all ready when you are :) [18:32:10] ostriches: uploading the package now [18:32:15] Ok [18:35:14] !log upload gerrit 2.13.3-wmf.1 to carbon [18:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:28] 06Operations, 06Performance-Team, 10Traffic: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2854865 (10Gilles) p:05Normal>03High a:03Gilles [18:35:28] ostriches: are you root on cobalt or shall I ? [18:35:34] I'm root [18:36:00] ostriches: ok! let me know how I can help [18:36:09] Mainly if it all goes sideways :) [18:36:23] heheh ok [18:37:26] Ah, whoops, got a puppet patch I forgot [18:38:11] We'll land after upgrade, easier [18:38:51] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:41:43] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2854878 (10Cmjohnson) @akosiaris I am sorry this got buried. Should we schedule a time? [18:43:07] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2854880 (10RobH) a:03kaldari This, being a sudo/root level request, will require review in the operations meeting. Before that happens though, it would be nice to have all the other requ... [18:44:28] Dangit, full reindex? [18:44:28] kaldari: ^ rights expansion and i thikn you are his manager? [18:44:30] Stupid gerrit.... [18:44:45] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2854882 (10akosiaris) @Cmjohnson Sure! Got any requested timeslot ? [18:44:51] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2851625 (10Krenair) This is a request for root in labs instances, which is not normally handled in this way. This doesn't build on his production shell rights. [18:45:31] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [18:45:51] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:45:56] um [18:46:01] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [18:46:10] oh, port 29418 [18:46:10] ok [18:46:34] reindexing with a metric shitton of threads... [18:47:46] online reindexing my ass. [18:48:01] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:48:27] And this is why I give myself an hour.... [18:48:33] ^^ that's weird [18:48:38] i just restarted the service, seemed to be fine [18:48:41] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:48:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [18:49:03] oh that's why ok ok [18:49:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [18:50:01] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [18:50:02] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment] [18:50:34] Dammit. [18:50:41] I hate you gerrit. [18:51:52] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [18:51:52] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 505.92 seconds [18:51:52] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [18:52:21] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:52:46] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2854899 (10akosiaris) Scheduled for Friday Dec 9th late US morning. I 'll be depooling+shutting down the hosts a bit before [18:52:52] Protip: we should make the git pull in puppet to not try it so often.... [18:52:54] uh I guess slave lag on m3 is gerrit also, checking [18:53:01] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 2 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_refinery_source],Exec[git_pull_analytics/discovery-stats],Exec[git_pull_aggregator_code],Exec[git_pull_analytics/reportupdater] [18:54:51] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [18:55:15] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2854900 (10Cmjohnson) H/W Raid is set up ssds raid 1 spinning disks raid 10 [18:55:51] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:56:02] reindex 2/3 done [18:57:05] 3/4 [18:57:06] ostriches the logger reindex is probaly all the accounts [18:57:17] the accounts reindexed fast. [18:57:23] It wants to reindex all changes too [18:57:24] Oh [18:57:40] "Online reindexing" is a huge lie :p [18:58:08] Oh are you trying the online reindex? [18:58:18] I think that only works for if your not upgrading gerrit [18:58:24] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.91 seconds Jcrespo https://phabricator.wikimedia.org/T151039 - The acknowledgement expires at: 2016-12-09 10:00:00. [18:58:27] No, because it wouldn't let me start until I did a full reindex. [18:58:38] yep [18:58:41] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:58:51] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:59:10] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2854920 (10bd808) I have had [[https://wikitech.wikimedia.org/w/index.php?title=Special%3AUserRights&user=BryanDavis|cloudadmin since 2014-09-25T17:09:06]] which functionally gives me root... [18:59:43] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2854923 (10Legoktm) I'm aiming to deploy this everywhere tomorrow (that way T152242 will hav... [18:59:52] ostriches we can add you to capability.administrateServer if you want [19:00:02] I already have that. [19:00:05] Oh [19:00:07] * Krenair facepalm [19:00:08] What good would that do? [19:00:28] in case someone hacks in and removes admin status or any rights from all users [19:00:39] That doesn't even make sense. [19:00:41] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:00:52] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [19:00:53] paladox: Chad is fine, but please don't go around offering administrator rights, especially as you are not a privileged user in gerrit yourself [19:00:59] If someone hacks gerrit we have bigger problems ;-) [19:01:42] jouncebot: next [19:01:42] In 0 hour(s) and 58 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T2000) [19:02:24] Krenair as chad is the maintainer of gerrit, i was asking if he should be added to that, i do not go around saying do you want to be an admin. I already know he is one. [19:02:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:03:02] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:03:31] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:03:59] ottomata: mirrormaker is you ? ^ [19:04:10] yeah [19:04:13] on it sorry [19:04:37] no worries, making sure [19:04:41] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [19:04:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:05:01] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:05:02] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [19:05:49] mutante: do you have a link handy to that bug where I said I'd refactor the eventloggign role classes? [19:05:51] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:06:05] Of course it gets slower the closer it gets to completing. [19:06:08] Fuck. This. [19:06:19] * TabbyCat eyerolls [19:07:57] Reindexing changes: projects: 99% (1743/1745), 94% (302429/320913) (/) [19:08:01] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [19:08:07] Reindexing changes: projects: 99% (1743/1745), 94% (302588/320913) (\) [19:08:19] * ostriches sighs [19:09:08] what? [19:09:37] what is the reindexing thing? [19:10:11] apergos it's for accounts and projects. [19:10:16] + more i think. [19:10:34] apergos: Gerrit being gerrit [19:10:55] software sucks syndrome? [19:11:02] sympathies... [19:11:21] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [19:11:24] 06Operations, 10Deployment-Systems, 06Performance-Team, 06Release-Engineering-Team, 07HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#2854985 (10Gilles) p:05Normal>03Triage [19:11:46] apergos: Gerrit supports online reindexing according to docs. [19:11:50] I'm convinced this is a complete lie [19:11:54] hahahaha [19:11:59] As every time I upgrade, I have to do a full offline reindex. [19:12:15] * ostriches shouts all kinds of dirty things at gerrit [19:12:35] http://stackoverflow.com/questions/31322148/online-reindexing-in-gerrit-2-11 [19:12:41] ostriches it's an ssh command. [19:13:08] That's not what I'm talking about [19:13:12] Just be quiet, you're not helping [19:13:45] Ok, sorry. [19:14:37] paladox: Sorry if I sound harsh, I'm just in a bad mood because of this and the extra commentary (from everyone) doesn't help much :) [19:14:50] Ok :) [19:14:51] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:18:18] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2855025 (10kaldari) I approve! [19:18:32] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2855026 (10kaldari) a:05kaldari>03None [19:27:42] ottomata: https://phabricator.wikimedia.org/T93645 should work, thanks! [19:27:51] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:28:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4752661 keys, up 37 days 11 hours - replication_delay is 34 [19:30:51] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:10] thanks mutante [19:32:19] mutante: confused about https://phabricator.wikimedia.org/T152081 then [19:32:37] i'd look at that ferrit pach to see buuuut [19:32:42] mariadb: Move eventlogging class to a single file ? [19:32:45] ah ok, not eventlogging role [19:33:18] ostriches: I'm guessing reindexing still not finished? [19:33:24] "Almost" [19:33:29] Reindexing changes: projects: 99% (1743/1745), 97% (312043/320913) (/) [19:33:37] But, it gets slower and slower as it gets closer to finishing [19:33:41] * ostriches shrugs [19:33:45] gghhhnnn [19:33:46] I need to figure out a way around this next time [19:34:04] Or, you know, stop using gerrit ;-) [19:34:06] 06Operations, 07Puppet, 10Analytics: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2855084 (10Ottomata) [19:34:46] hehehe talk about gnashing of teeth [19:34:54] ottomata: i would say 152081 is a subtask of the one i linked [19:34:59] either works [19:35:30] godog: I could've migrated all 1745 repositories to Phab in less time by now ;-) [19:35:38] actually, i am not sure what the intention is in 152081 [19:36:25] will have to look at gerrit later [19:36:27] * awight eyes ostriches warily [19:36:40] I've been meaning to ask here--is there a plan to migrate to diffusion? [19:36:48] I guess they are finally working on the upstream bug to [19:36:59] awight: no firm plans right now [19:37:06] cos fr-tech piloted some code review in that and we're hardline opposed to ever using it again [19:37:21] mutante: yeha, but i didn't realize it was about the mariadb evenetlogging stuff specificly [19:37:24] i renamed it [19:37:25] not that we'll try to veto or anything, but I respectfully request that we be looped in when the discussion takes off [19:37:27] i just made a new subtask too [19:37:30] thx! [19:37:40] to use differential without arc [19:37:44] ostriches: is there a task we can monitor? [19:37:45] which is pretty great [19:37:56] awight: another time (not now) I'd love to talk about that [19:37:59] apergos: ah interesting. yeah those Facebook utilities terrify me [19:38:00] 06Operations, 06Performance-Team: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473#2855134 (10Gilles) p:05Triage>03High [19:38:01] got it [19:38:03] https://secure.phabricator.com/T5000 [19:38:25] 06Operations, 06Performance-Team: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473#2849567 (10Gilles) p:05High>03Normal [19:38:31] it's not a fb utility. fb doesn't own/manage/write phab anymore :) [19:38:42] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2855150 (10Eevans) [19:38:46] apergos: thanks! Seems like that should be linked to a bigger task about using diffusion? [19:39:04] uh [19:39:07] ottomata: the part that matters for my ticket is that modules/role/manifests/eventlogging.pp gets split into "one class per file" [19:39:19] I don't know what their task setup and boards are like, awight [19:39:22] yup [19:39:29] mutante: just made https://phabricator.wikimedia.org/T152621 [19:39:34] apergos: ooh sorry, I mistook that for WMF's [19:39:46] ok [19:39:54] differential lets you upload git-diffs [19:40:03] ottomata: cool! thanks. also just saw there are some TODO comments in the eventlogging.pp file itself [19:40:39] * TabbyCat dinner [19:41:06] k, there's a great wiki page about the migration. https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Project/Differential_Migration [19:42:37] awight https://phabricator.wikimedia.org/T127 [19:42:52] Ewww, diffusion [19:43:09] ewww, gerrit [19:43:10] so much git logspam [19:43:26] gerrit is the absolute worst kind of person :p [19:43:34] hehe, beautiful choices we've got here [19:43:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:43:46] I vote for going back to code review over e-mail tbh :p [19:44:01] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:44:06] awight: the timeline on that page is out of date [19:44:09] if it's good enough for the kernel it's good enough for me [19:44:21] ostriches: I think there are actually more than a couple people who would like that [19:44:25] https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Project/Planning is the current RelEng team planning page [19:44:43] bawolff: Those were the days.... [19:44:49] before my time [19:44:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:45:11] PROBLEM - Check systemd state on kafka1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:45:19] me ^ [19:45:20] once upon a time brion did all code review by himself each monday and merged stuff by hand. those were the good old days. for some value of good [19:45:39] Good for people whose name != Brion, I suppose [19:45:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:45:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [19:45:55] bawolff: Maybe we *should* have bolted git support into Special:CodeReview :) [19:46:01] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [19:46:08] hahaha [19:46:11] RECOVERY - Check systemd state on kafka1013 is OK: OK - running: The system is fully operational [19:47:23] I have a feeling this won't finish in the next 13m and my train will be delayed [19:47:30] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2855176 (10Urbanecm) [19:47:33] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#2855175 (10Urbanecm) [19:47:44] Some Mussolini I am, can't even keep the trains on time.... [19:47:59] * apergos shudders [19:48:02] Why there are two wikibugs bots? [19:48:09] I think one will do... [19:48:11] paladox: ty! [19:48:16] Urbanecm: Twice the fun? [19:48:19] your welcome [19:49:01] I SWEAR TO ALL THAT IS GOOD AND HOLY IN THIS WORLD I AM GOING TO STAB SOMETHING [19:49:04] GAHHHHHHHHHHHH [19:49:12] LUCENE WAS THE DUMBEST IDEA EVER FOR GERRIT [19:49:22] I MEAN SERIOUSLY [19:49:34] FUCK. [19:49:54] (AND NO I'M NOT TURNING OFF CAPSLOCK, I SHALL CONTINUE TO YELL) [19:50:05] I have on my irc earmuffs, proceed [19:50:21] bring in figlet [19:50:31] how long was the last re-indexing, dare I ask? [19:50:51] ABOUT AS LONG, BUT THE DOCS LIED AND SAID THERE WAS NO OFFLINE REINDEXING NEEDED FOR 2.13.X [19:51:00] NOR DID MY TEST NEED ONE [19:51:14] naturally [19:51:31] ostriches, if this is wanted state... [19:51:55] Urbanecm: I have no idea re: wikibugs bot, I don't run it. I'm just being snarky (see my CAPSLOCK RAGE) [19:52:03] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2855179 (10RobH) a:03yuvipanda >>! In T152520#2854883, @Krenair wrote: > This is a request for root in labs instances, which is not normally handled in this way. This doesn't build on his... [19:53:55] just checking, feel free to answer in capslocks, any eta for gerrit happy again? :) (just planning the rest of afternoon work) [19:55:21] ottomata: When it's done. [19:55:32] haha oook [19:55:50] lol [19:55:53] 98%, but considering the way the timing goes, 2% is about 20% of the time :( [19:56:18] ostriches: in case it tempers my poorly timed vitriol about differential ;), I also fucking loathe gerrit with the best of 'em [19:56:29] I'm so sorry that you keep finding yourself in its bowels [19:56:43] * ostriches puts on his rubber boots [19:56:50] bowels are messy! [19:56:55] ostriches (sorry if this looks like spam, but about your comment earlyer about reviews through email, this https://gerrit-review.googlesource.com/#/c/89303/1/sessions/email-ingestion.md looks something like what you want?) [19:57:00] hip-waders [19:57:05] snake stick and a flashlight [19:57:37] most likly will be added in 2.14. [19:57:47] hold my gerrit, I'm going in [19:58:05] godog: Hold my beer, watch this! [19:58:06] :p [19:58:09] what will i do without gerrit!! [19:58:18] ebernhardson: Have some fun in your life? [19:58:20] :P [19:58:27] ostriches: hahaha nonononono.gif [19:58:51] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:59:02] Is there anybody who thing this is serious? Or maybe we should change the topic, it says "serious stuff" currently :D [19:59:06] *think [19:59:21] i'm super duper srs [19:59:26] this is srs bizniss [20:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T2000). [20:00:09] jouncebot: go away [20:00:56] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2855190 (10yuvipanda) Closest prior is when we gave root to the volunteer tools admins - since they already had root on tools, and nobody in the labs team objected, we just added their root... [20:01:06] https://gerrit-review.googlesource.com/#/c/92435/ [20:01:49] jouncebot: hug ostriches. his day isn't going according to plan [20:02:38] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.5 [20:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:24] 06Operations, 10Ops-Access-Requests: Requesting access to Labs Root for bd808 - https://phabricator.wikimedia.org/T152520#2855195 (10yuvipanda) @bd808 can you make a patch to labs/private with a new key (separate from your non-root labs key and your prod key)? [20:03:50] !log demon@tin Started scap: symlink update [20:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:25] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2855201 (10Aklapper) @mehtab.ahmed : Find one and propose one? [20:13:02] PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:14:01] RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:16:04] !log update prometheus-node-exporter in ulsfo/esams - T152580 [20:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:18] T152580: rollout prometheus-node-exporter 0.13 - https://phabricator.wikimedia.org/T152580 [20:21:03] 99% [20:21:16] But I got 1 problem, that last % :p [20:22:19] Occupy Gerrit [20:24:52] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 2 minutes ago with 14 failures. Failed resources (up to 3 shown): Package[debian-goodies],Package[apt-listchanges],Package[ethtool],Package[tshark] [20:25:46] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#2855117 (10Krenair) There are two serious technical ways to fix this. There may be policy reasons why not to do one or both of these 1) Tran... [20:25:50] zeno's paradox gone wrong [20:26:13] !log update prometheus-node-exporter in codfw - T152580 [20:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:28] T152580: rollout prometheus-node-exporter 0.13 - https://phabricator.wikimedia.org/T152580 [20:31:12] * yurik gives ostriches a big support cookie [20:31:22] nom nom nom [20:31:36] ostriches, you're running the train while gerrit is down? [20:31:37] cookie crashes [20:32:07] cookie has caused an unexpected digestive error, ... [20:33:29] !log update prometheus-node-exporter in eqiad - T152580 [20:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:44] T152580: rollout prometheus-node-exporter 0.13 - https://phabricator.wikimedia.org/T152580 [20:34:51] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:40] 06Operations, 06Performance-Team: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473#2855273 (10fgiunchedi) Grafana on labmon updated to 4.0.1, no issues reported so far. When further testing (alerting?) is completed I think we can upgrade production too. [20:38:40] 06Operations, 07Puppet, 06Analytics-Kanban: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2855278 (10Ottomata) [20:43:03] 320615/320913 [20:43:17] Krenair: Sure why not :) [20:43:23] ok... [20:43:34] Who needs gerrit for the train? :) [20:44:00] s/for the train// [20:44:02] hehehe [20:44:05] hahaha [20:44:35] Awaiting gerrit. [20:46:24] !log gerrit back [20:46:31] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.3 (SSHD-CORE-1.2.0) (protocol 2.0) [20:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:51] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [20:46:56] woot [20:46:56] There it is. Thanks! [20:47:01] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:47:20] joed_: You said the magic words I guess :) [20:48:01] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:48:06] (03CR) 10Chad: [V: 040 C: 032] Moving group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325804 (owner: 10Chad) [20:48:11] V: 0? :o [20:48:11] (03CR) 10jenkins-bot: [V: 040 C: 040] Moving group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325804 (owner: 10Chad) [20:48:17] All it needs to know is that I'm waiting on it. [20:48:21] that's new [20:48:51] (03Merged) 10jenkins-bot: Moving group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325804 (owner: 10Chad) [20:49:01] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:49:13] (03PS4) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) [20:50:11] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:50:21] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:51:02] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:51:12] !log demon@tin Finished scap: symlink update (duration: 47m 21s) [20:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:20] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2855321 (10Papaul) [20:52:52] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:53:44] (03PS2) 10Filippo Giunchedi: base: get rid of monthly ieee-data cronjob [puppet] - 10https://gerrit.wikimedia.org/r/325699 (https://phabricator.wikimedia.org/T152440) [20:54:02] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:54:06] (03PS2) 10Dzahn: add donatetowikipedia.[com|org] as parked domain [dns] - 10https://gerrit.wikimedia.org/r/325706 [20:54:11] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:54:51] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:51] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:56:41] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:56:43] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2855348 (10leila) [20:57:59] wow, gerrit works so fast now :) [20:58:11] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:58:18] i never appreciated it i guess until it was down :) [20:58:22] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2855353 (10leila) @Cmjohnson you helped us recently with T142780 . I added Ops-Access-Requests to this task but I'm not sure if I can/should do that. Ple... [20:59:00] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2855355 (10leila) [20:59:11] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:59:44] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2855359 (10leila) @Cmjohnson same comment as in T152023#2855353. [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161207T2100). Please do the needful. [21:01:19] (03CR) 10Dzahn: [V: 040 C: 032] add donatetowikipedia.[com|org] as parked domain [dns] - 10https://gerrit.wikimedia.org/r/325706 (owner: 10Dzahn) [21:01:31] 06Operations, 10ops-codfw, 10netops: ms-fe200[5-8] switch port configuration - https://phabricator.wikimedia.org/T152627#2855380 (10Papaul) [21:01:41] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:02:00] (03PS1) 10BryanDavis: Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) [21:02:25] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2835737 (10Peachey88) >>! In T152023#2855353, @leila wrote: > @Cmjohnson you helped us recently with T142780 . I added Ops-Access-Requests to this task b... [21:02:27] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2854279 (10Papaul) [21:02:41] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:03:01] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:03:47] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2855403 (10leila) Great. Thanks for confirming @Peachey88 . [21:03:52] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:04:11] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [21:06:02] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:06:03] * legoktm hugs ostriches <3 [21:08:21] !log starting Parsoid deploy [21:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:16] (03PS3) 10Dzahn: base: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325468 (owner: 10Tim Landscheidt) [21:09:21] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:09:43] !log arlolra@tin Starting deploy [parsoid/deploy@a77f72a]: (no message) [21:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:02] (03CR) 10Dzahn: [V: 040 C: 032] base: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325468 (owner: 10Tim Landscheidt) [21:11:31] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:11:41] interesting, the red 0 [21:11:49] but it had V+2 just not from me [21:11:57] as it should [21:12:09] hrm [21:12:40] that's a bit annoying [21:12:58] should be the other way around for this repo [21:13:05] red if human did it [21:13:37] hehe same for code review colors, adding code should be red [21:13:55] (03PS3) 10Dzahn: scap: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325501 (owner: 10Tim Landscheidt) [21:14:01] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:15:18] (03PS1) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:15:25] mutante ostriches ^^ :) [21:15:48] (03PS2) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:15:54] (03CR) 10Chad: [V: 040 C: 031] "This should go out now." [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [21:16:00] More important ^ [21:16:02] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:16:14] Like I said, I don't care about pretty right now [21:16:26] (03CR) 10Paladox: [V: 040 C: 031] "I've tested jdk 8 and it works. Lets switch :)" [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [21:16:46] lol it sticks with your code review too [21:17:00] (03PS2) 10Andrew Bogott: Nova policy: Make more read-only activities globally available [puppet] - 10https://gerrit.wikimedia.org/r/325806 [21:17:02] (03PS1) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325828 (https://phabricator.wikimedia.org/T150092) [21:17:15] (03PS2) 10Dzahn: Gerrit: Swap to using openjdk8 [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [21:17:24] (03CR) 10Chad: [V: 040 C: 031] "We're already using it, of course it works. This makes it permanent." [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [21:17:36] ah the restart [21:17:38] of course [21:17:43] apergos: No restart needed. [21:17:47] It's already using that. [21:17:55] well grrt-wm left for some reason or otehr [21:18:02] Oh, meh [21:18:10] I thought you meant something important ;-) [21:18:16] no no [21:18:38] (03CR) 10Dzahn: [V: 032 C: 032] Gerrit: Swap to using openjdk8 [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [21:18:43] I'm all out of important for the day [21:19:18] ostriches: there, i'll let you handle puppet on cobalt [21:19:27] tyvm [21:19:35] apergos that sounds like it is crashing [21:19:36] apergos: I'm out of fucks to give :p [21:19:38] and the bot also should not need a restart [21:19:39] * paladox looks up the logs [21:20:01] mutante nope, it could be crashing due to an a difference in gerrit 2.13 [21:20:02] (03CR) 10jenkins-bot: [V: 04-1] Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325828 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:20:06] or just crashing for no reason [21:20:39] ostriches does this update mean we can try jgit gc again [21:20:40] ? [21:20:51] Nope, I don't trust it [21:20:54] Never again [21:20:56] Ever. [21:20:58] Until the day I die [21:21:53] ok [21:21:57] !log arlolra@tin Finished deploy [parsoid/deploy@a77f72a]: (no message) (duration: 12m 13s) [21:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:11] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki] [21:23:15] Is the gerrit update done, or still in progress? [21:23:29] andrewbogott: {{done}} afaik [21:23:36] gerrit looks very 503 to me now [21:23:37] ostriches im getting a 500 Internal server error [21:23:40] ok. I had an issue but it seems to have fixed itself somehow [21:23:45] on https://gerrit.wikimedia.org/r/#/c/324972/ [21:23:48] in related changes [21:23:51] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:23:55] I'm not [21:24:00] Oh [21:24:02] Hmm [21:24:20] Eh, works for other changes. [21:24:20] worked couple of mins ago [21:24:26] (03PS6) 10Dzahn: udp2log: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325483 (owner: 10Tim Landscheidt) [21:24:46] _joe_: re https://gerrit.wikimedia.org/r/#/c/325046/ -- a seperate module puppet_enc might make more sense as well. Any opinions? [21:24:46] ostriches https://phabricator.wikimedia.org/F5006930 [21:24:47] * mutante uses it and seems fine [21:24:47] andrewbogott, SMalyshev: had one last restart as puppet got re-enabled [21:24:57] paladox: Meh [21:25:01] (03CR) 10jenkins-bot: [V: 04-1] Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:25:04] I'm out of energy to care. [21:25:07] although it is also connected to the puppetmaster, hm. [21:25:10] ostriches: ok, will wait 5 mins and retry [21:25:11] happends in edge so it is not a bug in ie. [21:25:19] SMalyshev: Nah should be back already [21:25:37] ostriches: https://gerrit.wikimedia.org/r/#/c/325821/ is still 503 for me [21:25:50] wfm [21:25:54] loads for me also [21:26:21] !log mholloway-shell@tin Starting deploy [mobileapps/deploy@adc5f07]: Update mobileapps to 2b1d206 [21:26:22] [2016-12-07 21:24:15,940] ERROR com.google.gerrit.httpd.restapi.RestApiServlet : Error in GET /r/changes/324972/revisions/1049322a1642c336b5cc660c62ec2fce0b67bdb1/related [21:26:22] java.lang.IllegalArgumentException: [PatchSet 324972,2] not found in [ChangeData{Change{324972 (I4a9ed6b0a58f2efe7679c3611a0499d536e20ce5), dest=operations/puppet,refs/heads/production, status=n}}, Change [21:26:22] Data{Change{325501 (I389e1b8b5e89244f0374d42bab48c56bdeb559ad), dest=operations/puppet,refs/heads/production, status=n}}, ChangeData{Change{325468 (I52f01d2ff70668b4d953f21fd4bc8c8998d85e76), dest=operati [21:26:22] ons/puppet,refs/heads/production, status=M}}, ChangeData{Change{325813 (I5a42571e40bb254566a458aa16117a819a1e2c0a), dest=operations/puppet,refs/heads/production, status=M}}] [21:26:31] (03PS3) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:26:34] SMalyshev: Mash F5 harder.... [21:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:38] Cuz wfm [21:26:47] I gotta gripe: the ui window is even wider now than it was. not quiet enough room on my laptop screen >_< [21:27:08] apergos: Sooo, I was watching a talk that upstream gave ~3y ago at EclipseCon. [21:27:11] ostriches: dashboard works... weird. [21:27:21] yes? [21:27:30] Said presenter actually said: 'Gerrit is wide. It was made for nice wide 15in monitors' [21:27:33] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@adc5f07]: Update mobileapps to 2b1d206 (duration: 01m 12s) [21:27:45] *&^%$# [21:27:47] Yeah [21:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:54] (03PS2) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [21:27:54] someone give that person a netbook. [21:27:57] a nice tiny netbook [21:28:17] He was complaining because he was presenting from a macbook air with like an 11in screen [21:28:22] * ostriches sighs [21:28:32] gerrits gone down [21:28:52] * apergos grits teeth [21:28:57] ostriches: hmm I still get 503 for that specific URL even though others work :( [21:29:11] confirming 503 now [21:29:11] SMalyshev: I dunno what to tell you. [21:29:13] It works. [21:29:28] f5 works [21:29:31] 503 just means it can't find the backend. [21:29:35] Which is certainly there. [21:30:17] SMalyshev: Stuck cache on your (or your ISP) end is all I can assume.... [21:30:25] intermittent, i could see it for a moment, ok after reload [21:30:34] ostriches: ok, will try to clean caches... thanks [21:31:09] i have no idea why the bot is crashing. [21:31:16] the logs are saying nothing [21:31:37] (03PS1) 10Chad: Gerrit: Remove useless space from config [puppet] - 10https://gerrit.wikimedia.org/r/325834 [21:31:49] (03CR) 10Chad: [C: 04-1] "Not now, I don't wanna restart again" [puppet] - 10https://gerrit.wikimedia.org/r/325834 (owner: 10Chad) [21:33:39] (03PS3) 10Merlijn van Deen: puppet_compiler: include puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/325053 (owner: 10Gerrit Patch Uploader) [21:33:53] ostriches: btw, in completely unrelated matters, what is the correct way to say ^demon? " Carrot Demon, the dark lord of vegetables"? Power Demon? [21:34:13] (03PS3) 10Merlijn van Deen: Puppet: refactor puppet-enc include [puppet] - 10https://gerrit.wikimedia.org/r/325046 [21:34:24] p858snake|L2_: "insert demon" is the way I say it [21:34:28] (03PS4) 10Merlijn van Deen: puppet_compiler: include puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/325053 (owner: 10Gerrit Patch Uploader) [21:34:40] I've been known to use INSERTdemon as a nick before [21:34:50] (03CR) 10Andrew Bogott: [C: 032] Nova policy: Make more read-only activities globally available [puppet] - 10https://gerrit.wikimedia.org/r/325806 (owner: 10Andrew Bogott) [21:34:55] (03PS4) 10Andrew Bogott: Nova policy: Make more read-only activities globally available [puppet] - 10https://gerrit.wikimedia.org/r/325806 [21:35:52] (03PS3) 10Dzahn: mediawiki: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325476 (owner: 10Tim Landscheidt) [21:36:38] !log updated Parsoid to version 3cf19c6b (T110910, T102209, T94949, T150112, T151570, T149209, T150213) [21:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:00] T102209: Derive heading ids from heading name, the same way MW core does - https://phabricator.wikimedia.org/T102209 [21:37:00] T110910: Implement extension natively inside Parsoid - https://phabricator.wikimedia.org/T110910 [21:37:01] T150213: Unknown contentmodels - https://phabricator.wikimedia.org/T150213 [21:37:01] T149209: Parsoid serialised an edit to a wikitext table adding a /n without stripping the double-pipes, breaking the table format (`\n|| align="right" | …`) - https://phabricator.wikimedia.org/T149209 [21:37:01] T151570: Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570 [21:37:01] T94949: Interwiki links to other MediaWiki wikis in the same cluster don't encode section fragment - https://phabricator.wikimedia.org/T94949 [21:37:01] T150112: Internal links pointing to interwikis are not encoded at all - https://phabricator.wikimedia.org/T150112 [21:37:18] /ignore stashbot [21:37:18] lol [21:38:01] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:38:40] 06Operations, 06Labs: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#2855573 (10chasemp) [21:38:51] (03PS1) 10Ottomata: Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) [21:38:56] (03PS1) 10Ottomata: Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) [21:38:58] (03PS4) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:39:54] (03CR) 10Dzahn: [C: 032] mediawiki: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325476 (owner: 10Tim Landscheidt) [21:39:59] (03PS4) 10Dzahn: mediawiki: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325476 (owner: 10Tim Landscheidt) [21:40:31] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:41:44] (03CR) 10jenkins-bot: [V: 04-1] Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [21:42:19] (03CR) 10Andrew Bogott: [C: 031] Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) [21:42:54] (03CR) 10Yuvipanda: [C: 031] Add labs root key for bd808 [labs/private] - 10https://gerrit.wikimedia.org/r/325824 (https://phabricator.wikimedia.org/T152520) (owner: 10BryanDavis) [21:43:54] (03PS2) 10Ottomata: Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) [21:43:58] (03CR) 10jenkins-bot: [V: 04-1] Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [21:44:12] (03PS1) 10ArielGlenn: move table job info to a default config file and add setting for override [dumps] - 10https://gerrit.wikimedia.org/r/325844 [21:49:26] (03CR) 10Ottomata: [C: 032] Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [21:49:32] (03PS3) 10Ottomata: Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) [21:49:47] (03CR) 10Paladox: "
  • • operations-puppet-tox-jessie https://integration.wikimedia.org/ci/job/operations-puppet-tox-jessie/10051/console : SUCCESS in 28s (03CR) 10Ottomata: [C: 032] "This is a no-op:" [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [21:49:55] (03CR) 10Ottomata: [V: 032 C: 032] Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325837 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [21:49:57] woops ^^ sorry [21:49:57] wrong place [21:51:02] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:53:08] (03PS1) 10Ottomata: Set kafka_producer_scheme to kafka-confluent for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325845 (https://phabricator.wikimedia.org/T142430) [21:53:55] (03PS2) 10Ottomata: Set kafka_producer_scheme to kafka-confluent for analytics eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/325845 (https://phabricator.wikimedia.org/T142430) [21:54:05] 06Operations, 06Labs: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#2855625 (10chasemp) [21:56:46] (03PS5) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:56:51] (03CR) 10Ottomata: [C: 032] "Looks good! https://puppet-compiler.wmflabs.org/4829/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/325845 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [21:56:54] (03PS6) 10Paladox: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 [21:59:45] (03PS3) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [22:01:11] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [22:05:42] (03PS1) 10Ottomata: Add python-confluent-kafka to eventlogging::dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325846 (https://phabricator.wikimedia.org/T142430) [22:06:01] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:06:12] (03CR) 10Ottomata: [V: 032 C: 032] Add python-confluent-kafka to eventlogging::dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325846 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [22:06:47] (03PS2) 10Ottomata: Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) [22:07:47] (03CR) 10jenkins-bot: [V: 04-1] Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [22:12:04] (03PS3) 10Ottomata: Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) [22:13:15] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2855681 (10Adrian_Bielefeldt) @leila I have read and signed the L3 document as well. [22:13:37] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#2855682 (10Jgreen) Fundraising uses mostly stock ganglia stuff, but there are a couple of simple collectors I've written or imported from production. It shouldn't be very difficult to refactor t... [22:14:32] (03CR) 10jenkins-bot: [V: 04-1] Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [22:15:36] (03PS4) 10Ottomata: Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) [22:18:14] (03CR) 10Paladox: [C: 031] "Tested this on http://gerrit-test.wmflabs.org/gerrit/#/c/17/ and it works." [puppet] - 10https://gerrit.wikimedia.org/r/325826 (owner: 10Paladox) [22:21:38] (03CR) 10Hashar: [C: 031] "Yup the regex is good :]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325826 (owner: 10Paladox) [22:22:04] 06Operations, 06Performance-Team, 10Traffic: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2855723 (10Gilles) ``` gilles@deployment-cache-text04:~$ sudo varnishlog -I BerespHeader:Backend-Timing -g raw ``` [22:23:11] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-confluent-kafka] [22:30:06] (03PS5) 10Dzahn: mediawiki: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325476 (owner: 10Tim Landscheidt) [22:31:31] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:31] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:33:35] (03PS2) 10Dzahn: Move ve files to role module [puppet] - 10https://gerrit.wikimedia.org/r/325701 (owner: 10Tim Landscheidt) [22:38:39] (03CR) 10Dzahn: [C: 032] Move ve files to role module [puppet] - 10https://gerrit.wikimedia.org/r/325701 (owner: 10Tim Landscheidt) [22:39:36] (03CR) 10Dzahn: [C: 032] mediawiki_singlenode: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325495 (owner: 10Tim Landscheidt) [22:40:29] (03PS3) 10Dzahn: mediawiki_singlenode: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325495 (owner: 10Tim Landscheidt) [22:40:31] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:41:31] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:42:46] (03PS4) 10Dzahn: scap: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325501 (owner: 10Tim Landscheidt) [22:43:53] 06Operations, 06Performance-Team, 10Traffic: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2182123 (10fgiunchedi) @gilles there's already a number of statsd/graphite python scripts in puppet that read varnish's shared memory. Could be of inspiration e.g. `varnishxcache` [22:43:59] (03CR) 10Dzahn: [C: 032] scap: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325501 (owner: 10Tim Landscheidt) [22:44:57] (03PS3) 10Dzahn: labstore: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325493 (owner: 10Tim Landscheidt) [22:45:35] 06Operations, 06Performance-Team, 10Traffic: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2855765 (10Gilles) Right, I wrote/refactored some :) Do you think we should add that feature to an existing one, or write a new one? Since this will look at almost all requests. [22:46:51] 06Operations, 06Labs: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#2855573 (10fgiunchedi) > The dataset is large. The current total for commons media[5] says Total file size for all files: 110,055,697,761,923 ? bytes (100.1 TB). -- I'm assuming that is unique dat... [22:47:24] (03CR) 10Dzahn: [C: 032] labstore: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325493 (owner: 10Tim Landscheidt) [22:51:57] (03CR) 10Dzahn: [C: 032] "confirmed, grep through the repo shows that all logrotate-related file sources are inside modules" [puppet] - 10https://gerrit.wikimedia.org/r/325577 (owner: 10Tim Landscheidt) [22:52:02] (03PS2) 10Dzahn: Remove obsolete logrotate files [puppet] - 10https://gerrit.wikimedia.org/r/325577 (owner: 10Tim Landscheidt) [22:54:30] (03PS4) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [22:55:51] (03PS2) 10Dzahn: toolserver_legacy: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325502 (owner: 10Tim Landscheidt) [22:59:15] (03CR) 10Dzahn: [C: 032] toolserver_legacy: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325502 (owner: 10Tim Landscheidt) [22:59:57] (03PS3) 10Dzahn: beta: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325469 (owner: 10Tim Landscheidt) [23:00:11] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:03:58] (03CR) 10Dzahn: [C: 032] beta: Fix puppet URL in comment [puppet] - 10https://gerrit.wikimedia.org/r/325469 (owner: 10Tim Landscheidt) [23:05:07] Request from 104.238.169.17 via cp3008 cp3008, Varnish XID 17076448 [23:05:08] Error: 503, Backend fetch failed at Wed, 07 Dec 2016 23:04:48 GMT [23:06:11] fixed [23:06:25] (03PS3) 10Dzahn: contint: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325470 (owner: 10Tim Landscheidt) [23:07:36] (03CR) 10Dzahn: [C: 032] contint: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325470 (owner: 10Tim Landscheidt) [23:12:13] (03CR) 10Dzahn: openstack: Fix puppet URLs in comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:14:20] !log update RESTBase to e2b319a1 - staging [23:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:58] (03PS2) 10Dzahn: install: add 'preseed'-role to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325737 (https://phabricator.wikimedia.org/T132757) [23:21:26] 06Operations, 06Performance-Team, 10Traffic: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#2855871 (10fgiunchedi) >>! In T131894#2855765, @Gilles wrote: > Right, I wrote/refactored some :) Do you think we should add that feature to an existing one, or write a new one? Since t... [23:22:43] (03CR) 10Hashar: "We are using OpenStack diskimage-builder ( http://docs.openstack.org/developer/diskimage-builder/ ). It has several phases when building " [puppet] - 10https://gerrit.wikimedia.org/r/325570 (owner: 10Hashar) [23:24:35] (03PS11) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [23:24:37] (03PS2) 10BBlack: Varnish: remove "varnish-be-rand" conftool service [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) [23:24:39] (03PS14) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [23:24:41] (03PS14) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [23:24:43] (03PS12) 10BBlack: cache_misc req_handling: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [23:24:45] (03PS12) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) [23:25:26] apparently 6x changes at once is the limit, 7 breaks it? :P [23:27:57] It should be 10 is the max batch [23:28:29] I suspect lolrrit, not gerrit itself [23:28:51] What the actual f.... [23:29:04] Did gerrit really just restart itself again? [23:29:13] bblack i have the logs [23:29:25] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2855891 (10RobH) [23:29:26] dosent looks like anything to do with it doing 6x changes [23:29:30] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2855908 (10RobH) [23:29:34] Well of course that wouldn't. [23:29:40] * ostriches sighs [23:29:42] maybe something about my changes? I could re-word the oldest one to re-hash them all and upload again to see if it crashes again :) [23:29:44] bblack https://phabricator.wikimedia.org/P4588 [23:29:45] I freaking hate.... [23:30:21] I've reported it here https://phabricator.wikimedia.org/P4588 [23:30:23] woops [23:30:27] https://github.com/martynsmith/node-irc/issues/485 [23:32:41] bblack ostriches im wondering should i try this fix https://github.com/martynsmith/node-irc/issues/485#issuecomment-262045206 [23:32:59] irc bot is least of my worries. [23:33:01] that hasen't really been tested so not sure of any other breakage but it may stop it from crashing? [23:34:24] !log update RESTBase to e2b319a1 - canary on restbase1007 [23:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:38] ostriches: I'm still periodically getting 503 on https://gerrit.wikimedia.org/r/#/c/325821/ :( Doesn't happen every time but sometimes... something is not right there I think [23:34:52] and on Phabricator as well [23:35:07] That sounds like cold cache [23:35:20] SMalyshev: Yeah, I saw it a bit ago too. I'm not a fan of this new release. [23:35:48] confirmed [23:35:51] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:54] !log update RESTBase to e2b319a1 [23:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:49] (03PS2) 10Chad: Gerrit: Remove useless space from config [puppet] - 10https://gerrit.wikimedia.org/r/325834 [23:46:35] (03CR) 10Dzahn: [V: 040 C: 032] install: add 'preseed'-role to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325737 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:55:06] (03PS3) 10Chad: Gerrit: Remove useless space from config [puppet] - 10https://gerrit.wikimedia.org/r/325834