[00:04:30] PROBLEM - Nginx local proxy to apache on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:20] RECOVERY - Nginx local proxy to apache on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.190 second response time [00:56:21] PROBLEM - Apache HTTP on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:20] RECOVERY - Apache HTTP on mw2146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [01:12:50] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:40] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 73401 bytes in 0.283 second response time [02:26:16] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.2) (duration: 08m 49s) [02:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 10 02:32:57 UTC 2017 (duration 6m 41s) [02:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.24 seconds [03:30:41] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:33:30] PROBLEM - Nginx local proxy to apache on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:20] RECOVERY - Nginx local proxy to apache on mw2201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.199 second response time [04:00:40] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:11:20] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 206.65 seconds [05:18:44] (03CR) 10Zoranzoki21: [C: 031] Disable OCG services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [05:27:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383305 (https://phabricator.wikimedia.org/T174509) [05:29:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383305 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:31:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383305 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:31:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383305 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:32:13] !log Optimize templatelinks and pagelinks on db1080 - T174509 [05:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:21] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:32:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080 - T174509 (duration: 00m 48s) [05:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383306 (https://phabricator.wikimedia.org/T174509) [05:36:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383306 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:37:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383306 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:37:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383306 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:38:58] !log Optimize templatelinks and pagelinks on db1076 - T174509 [05:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:04] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:39:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 - T174509 (duration: 00m 47s) [05:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383307 (https://phabricator.wikimedia.org/T174509) [05:43:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383307 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:45:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383307 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:45:44] !log Optimize templatelinks and pagelinks on db1071 - T174509 [05:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:51] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:46:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383307 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:46:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1071 - T174509 (duration: 00m 48s) [05:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383308 (https://phabricator.wikimedia.org/T174509) [05:51:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383308 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:53:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383308 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:53:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383308 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:53:33] !log Optimize templatelinks and pagelinks on db1079 - T174509 [05:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:40] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:54:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T174509 (duration: 00m 47s) [05:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:58] (03PS6) 10Marostegui: db-eqiad.php: Set commonswiki on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382379 (https://phabricator.wikimedia.org/T176883) [06:04:09] (03PS4) 10Marostegui: db1068: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/382380 (https://phabricator.wikimedia.org/T168661) [06:05:55] !log Stop MySQL on db1072 to move it to s3 - T172679 [06:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:03] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [06:07:08] (03PS3) 10Marostegui: mariadb: Move db1072 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/383085 (https://phabricator.wikimedia.org/T172679) [06:09:09] 10Operations, 10Pybal, 10Traffic: Alerts on LVS services with one single realserver - https://phabricator.wikimedia.org/T177815#3671128 (10ema) [06:09:18] 10Operations, 10Pybal, 10Traffic: Alerts on LVS services with one single realserver - https://phabricator.wikimedia.org/T177815#3671140 (10ema) p:05Triage>03Normal [06:10:49] (03CR) 10Marostegui: [C: 032] mariadb: Move db1072 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/383085 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:14:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383309 (https://phabricator.wikimedia.org/T172679) [06:17:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383309 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:18:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383309 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:18:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383309 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:19:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1038 - T172679 (duration: 00m 47s) [06:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:04] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [06:22:07] !log Stop MySQL on db1038 to transfer its data to db1072 - T172679 [06:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:02] (03CR) 10Ema: [C: 031] "LGTM and to pcc https://puppet-compiler.wmflabs.org/compiler02/8250/" [puppet] - 10https://gerrit.wikimedia.org/r/383073 (owner: 10Giuseppe Lavagetto) [06:27:37] !log Drop moodbar_feedback and moodbar_feedback_response from s1 - T153033 [06:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:43] T153033: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033 [06:27:51] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:51] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.010 second response time [06:30:34] (03PS1) 10Marostegui: s3.hosts,s1.hosts: Move db1072 from s1 to s3 [software] - 10https://gerrit.wikimedia.org/r/383310 (https://phabricator.wikimedia.org/T172679) [06:31:51] (03CR) 10Marostegui: [C: 032] s3.hosts,s1.hosts: Move db1072 from s1 to s3 [software] - 10https://gerrit.wikimedia.org/r/383310 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:32:41] (03Merged) 10jenkins-bot: s3.hosts,s1.hosts: Move db1072 from s1 to s3 [software] - 10https://gerrit.wikimedia.org/r/383310 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:35:25] !log Drop moodbar_feedback and moodbar_feedback_response from s3 - T153033 [06:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:32] T153033: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033 [06:38:54] (03PS2) 10Ema: prometheus: add nginx_cache_upload cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/382664 [06:39:36] (03CR) 10Ema: [C: 032] prometheus: add nginx_cache_upload cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/382664 (owner: 10Ema) [06:43:31] (03PS4) 10Giuseppe Lavagetto: profile::cache::ssl::unified: move from role, refactor [puppet] - 10https://gerrit.wikimedia.org/r/383073 [06:45:21] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::cache::ssl::unified: move from role, refactor [puppet] - 10https://gerrit.wikimedia.org/r/383073 (owner: 10Giuseppe Lavagetto) [06:58:07] 10Operations, 10Pybal, 10Traffic: Alerts on LVS services with one single realserver - https://phabricator.wikimedia.org/T177815#3671160 (10Joe) I would suggest we need to add a condition to the alert so that it gets skipped when the pool size is one backend only. [06:58:10] (03PS1) 10Marostegui: mariadb: Provision db2080 on s5 [puppet] - 10https://gerrit.wikimedia.org/r/383311 (https://phabricator.wikimedia.org/T170662) [06:59:10] (03PS1) 10Marostegui: s5.hosts: Add db2080 to s5 [software] - 10https://gerrit.wikimedia.org/r/383312 (https://phabricator.wikimedia.org/T170662) [07:00:11] (03PS1) 10Ema: role::cache::upload: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383313 [07:00:26] (03CR) 10jerkins-bot: [V: 04-1] role::cache::upload: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383313 (owner: 10Ema) [07:00:41] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db2080 to s5 [software] - 10https://gerrit.wikimedia.org/r/383312 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:01:29] (03Merged) 10jenkins-bot: s5.hosts: Add db2080 to s5 [software] - 10https://gerrit.wikimedia.org/r/383312 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:01:47] (03CR) 10Marostegui: [C: 032] "This looks good: https://puppet-compiler.wmflabs.org/compiler02/8251/" [puppet] - 10https://gerrit.wikimedia.org/r/383311 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:04:56] !log Stop MySQL on db2079 to clone db2080 - https://phabricator.wikimedia.org/T170662 [07:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:07] (03PS2) 10Ema: role::cache::upload: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383313 [07:05:17] (03CR) 10jerkins-bot: [V: 04-1] role::cache::upload: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383313 (owner: 10Ema) [07:07:32] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3671174 (10akosiaris) >>! In T177225#3669602, @MoritzMuehlenhoff wrote: > For dropping the salt minion, the removal was done in two stages, first a commit which purged the packages and... [07:09:05] 10Operations, 10Puppet, 10Operations-Software-Development: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133#3671175 (10jcrespo) Not lately. [07:10:03] 10Operations, 10Puppet, 10Operations-Software-Development: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133#3671176 (10jcrespo) Is the request so unreasonable to stay with "very low"? [07:13:40] (03PS3) 10Muehlenhoff: Jenkins now supports our MAC/KEX algorithms [prod] [puppet] - 10https://gerrit.wikimedia.org/r/383122 (https://phabricator.wikimedia.org/T103351) (owner: 10Hashar) [07:16:16] PROBLEM - Disk space on flerovium is CRITICAL: DISK CRITICAL - free space: /mnt/2a 1493608 MB (3% inode=96%) [07:17:38] 10Operations, 10Analytics-Kanban, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671178 (10akosiaris) We 've had part of this discussion in #wikimedia-netops IRC channel. I can post a backlog (we don't have a bot yet archiving that channel) but some first (and partial) consensus se... [07:32:29] (03PS2) 10Giuseppe Lavagetto: role::cache::text: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383078 [07:35:26] (03PS1) 10Elukey: Remove new appserver rule from the hiera regex yaml [puppet] - 10https://gerrit.wikimedia.org/r/383315 (https://phabricator.wikimedia.org/T165519) [07:35:55] (03CR) 10Elukey: [C: 032] Remove new appserver rule from the hiera regex yaml [puppet] - 10https://gerrit.wikimedia.org/r/383315 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [07:42:47] (03PS1) 10Muehlenhoff: Provide a reboot wrapper for Cumin clients [puppet] - 10https://gerrit.wikimedia.org/r/383316 [07:50:11] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3671190 (10Joe) >>! In T177276#3666812, @Legoktm wrote: > Some requirements of this build process: > * Basic macro support: > ** `{{ apt... [07:54:04] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/8253/" [puppet] - 10https://gerrit.wikimedia.org/r/383078 (owner: 10Giuseppe Lavagetto) [07:56:20] (03CR) 10Gehel: Provide a reboot wrapper for Cumin clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383316 (owner: 10Muehlenhoff) [07:57:46] (03PS3) 10Ema: role::cache::text: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383078 (owner: 10Giuseppe Lavagetto) [07:58:12] (03CR) 10Giuseppe Lavagetto: [C: 031] role::cache::upload: move to role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383313 (owner: 10Ema) [07:58:43] (03CR) 10Ema: [C: 031] role::cache::text: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383078 (owner: 10Giuseppe Lavagetto) [07:59:19] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2080 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383317 (https://phabricator.wikimedia.org/T170662) [08:00:04] Amir1: Your horoscope predicts another unfortunate ores_classification table cleanup deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T0800). [08:00:04] No GERRIT patches in the queue for this window AFAICS. [08:00:54] (03CR) 10Giuseppe Lavagetto: [C: 032] role::cache::text: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383078 (owner: 10Giuseppe Lavagetto) [08:01:36] 10Operations, 10Analytics-Kanban, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671196 (10elukey) >>! In T177511#3671178, @akosiaris wrote: > We 've had part of this discussion in #wikimedia-netops IRC channel. I can post a backlog (we don't have a bot yet archiving that channel)... [08:02:41] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2080 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383317 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:05:04] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2080 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383317 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:06:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2080 to the config - T170662 (duration: 00m 47s) [08:06:12] (03PS3) 10Ema: role::cache::upload: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383313 [08:06:14] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2080 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383317 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:20] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [08:07:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2080 to the config - T170662 (duration: 00m 46s) [08:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:38] (03CR) 10Ema: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler03/8255/" [puppet] - 10https://gerrit.wikimedia.org/r/383313 (owner: 10Ema) [08:14:46] (03PS3) 10Gehel: [test] mediawiki: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/383147 (https://phabricator.wikimedia.org/T175242) [08:15:19] (03CR) 10Gehel: [C: 032] [test] mediawiki: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/383147 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [08:16:39] (03PS7) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [08:17:10] (03CR) 10jerkins-bot: [V: 04-1] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [08:20:30] 10Operations, 10Analytics-Kanban, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671213 (10elukey) @akosiaris what are the steps to take to move druid100[345] outside the analytics vlan(s)? From what I know we'd need to: 1) Properly remove the hosts from service (already done by A... [08:20:46] (03CR) 10Qgil: [C: 04-1] "Considering that interwiki support will not be implemented any time soon, I think it would be better to abandon this patch. Every time I s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [08:24:32] 10Operations, 10Analytics-Kanban, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671215 (10akosiaris) >>! In T177511#3671213, @elukey wrote: > @akosiaris what are the steps to take to move druid100[345] outside the analytics vlan(s)? From what I know we'd need to: > > 1) Properly... [08:29:31] (03CR) 10Gehel: "https://gerrit.wikimedia.org/r/#/c/383147/ has been merged as a test on a few nodes. Logs are still flowing, all looks good for those node" [puppet] - 10https://gerrit.wikimedia.org/r/383146 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [08:30:21] 10Operations: Integrate stretch 9.2 point release - https://phabricator.wikimedia.org/T177739#3671220 (10MoritzMuehlenhoff) [08:33:48] (03PS8) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [08:34:20] (03CR) 10jerkins-bot: [V: 04-1] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [08:46:43] (03PS1) 10Elukey: Reassign IPs to druid100[456] to move them out of the Analytics VLAN [dns] - 10https://gerrit.wikimedia.org/r/383318 (https://phabricator.wikimedia.org/T177511) [08:48:51] (03PS9) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [08:49:22] (03CR) 10jerkins-bot: [V: 04-1] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [08:49:24] (03CR) 10Elukey: "I moved IPs from the analytics vlan row X to the first free spot in the correspondent private1-X subnet, not sure if this is enough or not" [dns] - 10https://gerrit.wikimedia.org/r/383318 (https://phabricator.wikimedia.org/T177511) (owner: 10Elukey) [08:49:52] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671257 (10akosiaris) > 5. Remove druid100[456] from any router ACL entry (Garbage collection). > > I 'll do 5, the rest all LGTM Done (I had nothing to do actually, there are no... [08:51:39] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671259 (10elukey) >>! In T177511#3671257, @akosiaris wrote: >> 5. Remove druid100[456] from any router ACL entry (Garbage collection). >> >> I 'll do 5, the rest all LGTM > > Do... [08:52:05] (03PS1) 10Giuseppe Lavagetto: role::cache::misc: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383319 [08:52:07] (03PS1) 10Giuseppe Lavagetto: role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 [08:54:49] (03CR) 10Ema: [C: 04-1] role::cache::misc: move to role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383319 (owner: 10Giuseppe Lavagetto) [08:59:07] (03CR) 10Ema: "noop everywhere except on pinkunicorn https://puppet-compiler.wmflabs.org/compiler02/8258/" [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [09:02:29] (03CR) 10Alexandros Kosiaris: [C: 031] Reassign IPs to druid100[456] to move them out of the Analytics VLAN [dns] - 10https://gerrit.wikimedia.org/r/383318 (https://phabricator.wikimedia.org/T177511) (owner: 10Elukey) [09:09:40] (03PS2) 10Giuseppe Lavagetto: role::cache::misc: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383319 [09:09:42] (03PS2) 10Giuseppe Lavagetto: role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 [09:09:44] (03PS1) 10Giuseppe Lavagetto: profile::cache::base: explicitly declare varnish classes [puppet] - 10https://gerrit.wikimedia.org/r/383322 [09:10:53] (03CR) 10Gehel: [C: 032] Upgrade plugins (official LTR RC1, extra 5.5.2.3, highlighter 5.5.2.2) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/381798 (owner: 10DCausse) [09:10:56] (03CR) 10Gehel: [V: 032 C: 032] Upgrade plugins (official LTR RC1, extra 5.5.2.3, highlighter 5.5.2.2) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/381798 (owner: 10DCausse) [09:13:26] (03PS2) 10Giuseppe Lavagetto: profile::cache::base: explicitly declare varnish classes [puppet] - 10https://gerrit.wikimedia.org/r/383322 [09:19:42] (03CR) 10Thiemo Mättig (WMDE): [C: 031] PHP CodeSniffer no more process autogenerated files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383234 (owner: 10Hashar) [09:19:46] (03PS5) 10Filippo Giunchedi: smart: new module [puppet] - 10https://gerrit.wikimedia.org/r/378039 (https://phabricator.wikimedia.org/T86552) [09:20:52] (03CR) 10Filippo Giunchedi: [C: 032] smart: new module [puppet] - 10https://gerrit.wikimedia.org/r/378039 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [09:22:25] (03CR) 10Volans: "What about a simple 'reboot-host' ?" [puppet] - 10https://gerrit.wikimedia.org/r/383316 (owner: 10Muehlenhoff) [09:22:29] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "This is ready to be live on production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383163 (https://phabricator.wikimedia.org/T176863) (owner: 10Lucas Werkmeister (WMDE)) [09:25:13] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/382415 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:26:54] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Do not make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352797 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [09:27:38] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3671327 (10fgiunchedi) [09:28:08] (03PS3) 10Giuseppe Lavagetto: role::cache::misc: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383319 [09:28:10] (03PS3) 10Giuseppe Lavagetto: role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 [09:28:12] (03PS3) 10Giuseppe Lavagetto: profile::cache::base: explicitly declare varnish classes [puppet] - 10https://gerrit.wikimedia.org/r/383322 [09:28:27] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable syslog over tls for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/382415 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:28:34] (03PS2) 10Filippo Giunchedi: hieradata: enable syslog over tls for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/382415 (https://phabricator.wikimedia.org/T136312) [09:28:44] (03CR) 10Thiemo Mättig (WMDE): [C: 031] labs: do not replicate wb_entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/382694 (https://phabricator.wikimedia.org/T95685) (owner: 10Ladsgroup) [09:28:57] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Do not rebuild wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [09:32:11] (03CR) 10Muehlenhoff: Provide a reboot wrapper for Cumin clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383316 (owner: 10Muehlenhoff) [09:34:24] (03CR) 10Ema: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8261/" [puppet] - 10https://gerrit.wikimedia.org/r/383319 (owner: 10Giuseppe Lavagetto) [09:39:56] 10Operations, 10netops: Allow syslog (-tls) from both wezen and lithium in labs - https://phabricator.wikimedia.org/T177820#3671443 (10fgiunchedi) [09:57:17] (03PS4) 10Giuseppe Lavagetto: role::cache::misc: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383319 [09:58:46] (03PS1) 10Marostegui: db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 [09:59:00] (03CR) 10Giuseppe Lavagetto: "I can see the practical uses of this, but I was kinda glad we could not reboot the fleet with one cumin command. Are we sure that's not a " [puppet] - 10https://gerrit.wikimedia.org/r/383316 (owner: 10Muehlenhoff) [09:59:19] (03CR) 10Giuseppe Lavagetto: [C: 032] role::cache::misc: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/383319 (owner: 10Giuseppe Lavagetto) [10:05:02] (03CR) 10Muehlenhoff: "That's definitely a desirable feature and frequently used for cluster-wide reboots. It actually fixes a regression in the Cumin feature se" [puppet] - 10https://gerrit.wikimedia.org/r/383316 (owner: 10Muehlenhoff) [10:09:01] (03CR) 10Ema: [C: 031] role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 (owner: 10Giuseppe Lavagetto) [10:14:45] (03CR) 10Ema: [C: 031] profile::cache::base: explicitly declare varnish classes [puppet] - 10https://gerrit.wikimedia.org/r/383322 (owner: 10Giuseppe Lavagetto) [10:15:30] 10Operations, 10Analytics-Kanban, 10netops, 10Patch-For-Review, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671561 (10elukey) [10:23:31] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/8264/" [puppet] - 10https://gerrit.wikimedia.org/r/383320 (owner: 10Giuseppe Lavagetto) [10:23:43] (03CR) 10Giuseppe Lavagetto: [C: 032] role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 (owner: 10Giuseppe Lavagetto) [10:24:02] (03PS4) 10Giuseppe Lavagetto: role::cache::instances: move to cacheproxy module [puppet] - 10https://gerrit.wikimedia.org/r/383320 [10:26:10] (03PS4) 10Giuseppe Lavagetto: profile::cache::base: explicitly declare varnish classes [puppet] - 10https://gerrit.wikimedia.org/r/383322 [10:26:22] !log start of cleaning up ores_classification in wikidatawiki (T159753) [10:26:29] sorry I was very very late [10:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:30] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [10:28:07] (03CR) 10Daniel Kinzler: "Has this been announced again to the labs mailing list?" [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [10:28:44] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8265/" [puppet] - 10https://gerrit.wikimedia.org/r/383322 (owner: 10Giuseppe Lavagetto) [10:33:02] !log T177511 switch druid100[456] to private1-x-eqiad VLANs [10:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:09] T177511: LVS for Druid - https://phabricator.wikimedia.org/T177511 [10:42:07] (03CR) 10Elukey: [C: 032] Reassign IPs to druid100[456] to move them out of the Analytics VLAN [dns] - 10https://gerrit.wikimedia.org/r/383318 (https://phabricator.wikimedia.org/T177511) (owner: 10Elukey) [10:44:25] (03PS1) 10Muehlenhoff: Remove libapt leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/383331 [10:45:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Jenkins: Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#3671597 (10Paladox) [10:45:34] (03PS1) 10Mforns: Add cron job for analytics banner activity cleaner [puppet] - 10https://gerrit.wikimedia.org/r/383332 (https://phabricator.wikimedia.org/T164497) [11:10:28] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just a grammar nit in commit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383331 (owner: 10Muehlenhoff) [11:16:20] (03PS1) 10Ladsgroup: admin: new ssh key for Ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/383334 [11:18:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383335 [11:18:28] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383335 [11:18:48] 10Operations, 10Analytics-Kanban, 10netops, 10Patch-For-Review, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3661358 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1004.eqiad.wmnet'] ``` The log can be found... [11:18:58] 10Operations, 10Analytics-Kanban, 10netops, 10Patch-For-Review, 10User-Elukey: LVS for Druid - https://phabricator.wikimedia.org/T177511#3671633 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1004.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['druid1004.eqiad.wmnet'] ``` [11:20:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383335 (owner: 10Marostegui) [11:20:47] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3671635 (10MoritzMuehlenhoff) >>! In T52864#3669128, @MarcoAurelio wrote: > @MoritzMuehlenhoff Does that help or ease the suggested migration? Thanks.... [11:21:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383335 (owner: 10Marostegui) [11:21:45] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383335 (owner: 10Marostegui) [11:22:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383336 [11:23:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T174509 (duration: 01m 21s) [11:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:13] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:23:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383336 [11:24:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383336 (owner: 10Marostegui) [11:26:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383336 (owner: 10Marostegui) [11:26:50] (03PS1) 10Volans: wmf-auto-reimage: fix console list initialization [puppet] - 10https://gerrit.wikimedia.org/r/383341 [11:27:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383336 (owner: 10Marostegui) [11:28:05] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix console list initialization [puppet] - 10https://gerrit.wikimedia.org/r/383341 (owner: 10Volans) [11:30:01] PROBLEM - SSH on silver is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:31:08] !log upgrading relforge to elastic 5.5.2 [11:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 - T174509 (duration: 05m 02s) [11:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:43] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:32:11] !log downgrading HHVM on deployment-mediawiki04 to HHVM 3.18.2 temporarily (to further narrow down a problem with the new wikidiff package) [11:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:02] RECOVERY - SSH on silver is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [11:38:38] 10Operations: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3671670 (10Aklapper) [11:49:49] hey, my ssh key can be merged soon, I lost access to my account. https://gerrit.wikimedia.org/r/#/c/383334/ [11:52:58] (03PS1) 10DCausse: [elasticsearch] Mark production plugins as mandatory [puppet] - 10https://gerrit.wikimedia.org/r/383345 [11:55:19] (03CR) 10DCausse: [C: 04-1] "it needs elastic 5.5.2" [puppet] - 10https://gerrit.wikimedia.org/r/383345 (owner: 10DCausse) [12:17:44] (03PS10) 10Amire80: Remove compact language links dblist for simplicity (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [12:18:44] (03CR) 10Amire80: Remove compact language links dblist for simplicity (no-op) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [12:23:09] (03CR) 10Gehel: [elasticsearch] Mark production plugins as mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383345 (owner: 10DCausse) [12:24:23] (03CR) 10DCausse: [C: 04-1] [elasticsearch] Mark production plugins as mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383345 (owner: 10DCausse) [12:25:07] (03CR) 10Gehel: [elasticsearch] Mark production plugins as mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383345 (owner: 10DCausse) [12:27:23] (03PS1) 10Volans: wmf-auto-reimage: fix console list initialization (2) [puppet] - 10https://gerrit.wikimedia.org/r/383348 [12:28:07] (03CR) 10Elukey: [C: 031] wmf-auto-reimage: fix console list initialization (2) [puppet] - 10https://gerrit.wikimedia.org/r/383348 (owner: 10Volans) [12:28:15] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix console list initialization (2) [puppet] - 10https://gerrit.wikimedia.org/r/383348 (owner: 10Volans) [12:34:00] (03PS2) 10Muehlenhoff: Remove libapt leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/383331 [12:39:15] (03CR) 10Nikerabbit: Remove compact language links dblist for simplicity (no-op) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [12:41:29] (03PS3) 10Muehlenhoff: Remove libapt leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/383331 [12:43:21] (03PS2) 10Marostegui: db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 [12:47:22] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:31] (03PS2) 10Amire80: Configure wmgBabelMainCategory for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366233 [12:48:48] (03PS10) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [12:49:02] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:15] (03CR) 10jerkins-bot: [V: 04-1] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [12:49:25] (03CR) 10Zoranzoki21: "OK I will do it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [12:49:27] (03CR) 10Nikerabbit: Remove compact language links dblist for simplicity (no-op) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [12:49:30] (03PS11) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) [12:50:29] (03PS11) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [12:51:28] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3671838 (10Addshore) @MoritzMuehlenhoff and I just spent some time trying to reproduce this. We rolled hhvm back as... [12:51:42] (03PS2) 10Aude: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) [12:53:52] (03Abandoned) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [12:54:29] (03PS3) 10Aude: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) [12:54:43] (03PS2) 10EddieGP: wikitech: Align 'contentadmin' and 'sysop' permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) [12:57:01] PROBLEM - Check for gridmaster host resolution TCP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:58:51] PROBLEM - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:58:51] suppose i could do swat today [12:59:54] (03PS4) 10Zoranzoki21: Enable SandboxLink on gawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383145 (https://phabricator.wikimedia.org/T177775) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T1300). [13:00:05] Lucas_WMDE and aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:35] hmm labs dns is down. [13:00:54] [14:57] PROBLEM - Check for gridmaster host resolution TCP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:58] PROBLEM - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:00:59] chasemp andrewbogott ^^ [13:01:10] I can SWAT today [13:01:19] [14:58] suppose i could do swat today [13:01:33] Zoranzoki21, aude: ah, did not notice that :) [13:01:46] aude: good luck with swat :) [13:02:17] Jenkis is now small slow [13:02:28] (03CR) 10Aude: "current setting:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [13:03:15] zeljkof: ok [13:03:34] i was just double checking my config patch is correct [13:03:44] before i start [13:04:23] (03CR) 10Muehlenhoff: [C: 032] Remove libapt leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/383331 (owner: 10Muehlenhoff) [13:05:18] (03CR) 10Zoranzoki21: [C: 031] Configure wmgBabelMainCategory for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366233 (owner: 10Amire80) [13:05:34] (03CR) 10Aude: [C: 04-1] "difference is that there will be a dot in the cache key name" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [13:06:10] Zoranzoki21: i'll deploy yours first [13:06:28] ok [13:06:40] 10Operations: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#3668533 (10BBlack) > We could run one of our NTP servers based on Chrony parallel to the existing ones to see whether it meets our needs (which are fairly limited in terms of NTP features) Sounds like a go... [13:06:46] You today have only one.. :D [13:06:56] :) [13:08:16] (03CR) 10Aude: [C: 032] Enable SandboxLink on gawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383145 (https://phabricator.wikimedia.org/T177775) (owner: 10Zoranzoki21) [13:09:19] 10Operations, 10Wikimedia-Logstash: rsyslog on mw1180 seems to not use the logstash LVS endpoint - https://phabricator.wikimedia.org/T177833#3671875 (10Gehel) [13:09:30] (03PS1) 10Elukey: wmf_auto_reimage_host: fix check in mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/383351 [13:09:49] (03Merged) 10jenkins-bot: Enable SandboxLink on gawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383145 (https://phabricator.wikimedia.org/T177775) (owner: 10Zoranzoki21) [13:09:59] (03CR) 10jenkins-bot: Enable SandboxLink on gawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383145 (https://phabricator.wikimedia.org/T177775) (owner: 10Zoranzoki21) [13:11:25] !log restarting rsyslog on mw1180 - T177833 [13:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:38] T177833: rsyslog on mw1180 seems to not use the logstash LVS endpoint - https://phabricator.wikimedia.org/T177833 [13:12:12] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:22] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:24] gehel: interesting, I wonder if it has anything to do with the syslog-tls change I deployed today [13:13:26] (03PS1) 10Filippo Giunchedi: hieradata: enable syslog over tls for esams [puppet] - 10https://gerrit.wikimedia.org/r/383352 (https://phabricator.wikimedia.org/T136312) [13:13:34] godog: did your change have anything to do with logstash? [13:13:44] (03PS1) 10BBlack: WMF-Last-Access-Global: not for wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/383353 (https://phabricator.wikimedia.org/T174640) [13:13:51] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:52] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:00] Zoranzoki21: your change is on mwdebug1002 [13:14:21] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:24] 10Operations, 10Cloud-Services: Wikimedia Cloud (labs) dns is intermittingly failing - https://phabricator.wikimedia.org/T177834#3671897 (10Paladox) [13:14:31] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:34] gehel: no, it changed rsyslog configuration to use tcp and syslog-tls towards syslog.{eqiad,codfw}.wmnet [13:14:34] akosiaris: seems puppetdb got restarted (again) [13:14:59] Active: active (running) since Tue 2017-10-10 13:09:57 UTC; 4min 56s ago [13:15:11] nice [13:15:17] * aude checks [13:15:25] (03PS2) 10Elukey: wmf_auto_reimage_host: fix check in mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/383351 [13:15:27] akosiaris: [ +0.000002] Out of memory: Kill process 12103 (java) score 356 or sacrifice child [13:15:31] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:31] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:33] * Dereckson will have a confg change to add later in the window [13:15:41] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:52] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:52] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:16:13] volans: Killed process 12103 (java) total-vm:12242204kB, anon-rss:6197636kB, file-rss:0kB, shmem-rss:0kB [13:16:22] so... 6GB RSS and 12 VM [13:16:51] it's suspiciously enough close to -Xmx6G [13:16:57] Dereckson: ok [13:17:01] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:02] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:05] sandbox link looks ok to me [13:17:10] at the same time it can't be that java issues an OOM [13:17:22] right [13:17:29] akosiaris: look at: sudo dmesg -T | grep "Out of memory: Kill process" [13:17:34] quite often [13:17:57] so, -XX:+ExitOnOutOfMemoryError works ? [13:18:14] oh wait... that's always the OOM [13:18:15] unrelated [13:19:10] (03PS12) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [13:19:10] anon-rss values vary wildly though [13:19:15] (03CR) 10Volans: [C: 031] "LGTM, thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/383351 (owner: 10Elukey) [13:19:27] anon-rss:6706068kB, anon-rss:6651808kB, anon-rss:6661996kB, anon-rss:6279548kB, anon-rss:6197636kB, [13:19:36] at around the time of the crash there is one error in puppetdb.log [13:19:37] 2017-10-10 13:10:13,250 ERROR [o.a.a.b.BrokerService] Temporary Store limit is 50000 mb, whilst the temporary data directory: /var/lib/puppetdb/mq/localhost/tmp_storage only has 22682 mb of usable space [13:19:50] akosiaris: do we have GC lgos for puppetdb? I'm happy to have a look... [13:20:03] > sandbox link looks ok to me I too translated name.. It will be deployed, when L10n bot update translates [13:20:04] <_joe_> hey you all [13:20:04] gehel: no, but feel free to enable them [13:20:08] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on gawiki (duration: 01m 50s) [13:20:11] <_joe_> we're upgrading puppetdb this quarter [13:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:35] _joe_: think that may fix this ? [13:20:37] <_joe_> don't bother go in deep debugging mode right now [13:20:40] Zoranzoki21: done [13:20:44] <_joe_> akosiaris: it might, it might not [13:21:04] (03CR) 10Aude: [C: 032] Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:21:07] <_joe_> we're using a 4 years old version of puppetdb, whatever debugging we do now will be useless anyways [13:21:31] PROBLEM - SSH on silver is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:32] _joe_: it happened 9 times during last month [13:21:48] [unrelated] mmh this is the second time today for silver SSH alarm [13:22:39] (03PS3) 10Elukey: wmf_auto_reimage_host: fix check in mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/383351 [13:23:04] <_joe_> volans: I am aware, if you want to waste a week debugging something no one is developing and that we will decom in 2 months tops, be my guest [13:23:26] <_joe_> I think we have many other priorities before that [13:23:43] (03CR) 10BBlack: varnish: add support for version 5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [13:23:52] paladox: thanks for the ping, I'm looking [13:23:53] (03CR) 10Elukey: [C: 032] wmf_auto_reimage_host: fix check in mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/383351 (owner: 10Elukey) [13:24:01] thank you very much [13:24:04] andrewbogott thanks :). [13:24:06] (03CR) 10BBlack: varnish: add support for version 5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [13:24:18] what _joe_says makes sense- what is the way if it happens again- does it autostart, does it need a kick? [13:24:22] RECOVERY - SSH on silver is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:24:40] sorry, I was disconnected while the issue appeared [13:24:47] jynus: no, systemd takes care of restarting it, so it's just noise [13:24:49] (03CR) 10Aude: [C: 032] Switch to new wbcheckconstraints API format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383162 (https://phabricator.wikimedia.org/T175590) (owner: 10Lucas Werkmeister (WMDE)) [13:24:55] ok, good to know [13:25:03] _joe_: I was not saying we should, just adding some context [13:25:11] I'm not looking at it, that's for sure ;) [13:25:12] I would just send an email so everbody is aware [13:25:41] (03PS2) 10Aude: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 [13:25:44] 10Operations: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#3671940 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:25:47] (03CR) 10Aude: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:25:52] (03CR) 10Aude: [C: 032] Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:28:21] 10Operations, 10Wikimedia-Logstash: rsyslog on mw1180 seems to not use the logstash LVS endpoint - https://phabricator.wikimedia.org/T177833#3671967 (10Gehel) [13:28:21] (03Merged) 10jenkins-bot: Switch to new wbcheckconstraints API format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383162 (https://phabricator.wikimedia.org/T175590) (owner: 10Lucas Werkmeister (WMDE)) [13:28:31] (03CR) 10jenkins-bot: Switch to new wbcheckconstraints API format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383162 (https://phabricator.wikimedia.org/T175590) (owner: 10Lucas Werkmeister (WMDE)) [13:29:51] (03PS3) 10Aude: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 [13:29:57] (03PS13) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [13:30:11] (03CR) 10Aude: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:30:15] (03CR) 10Aude: [C: 032] Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:30:25] (03CR) 10Ema: varnish: add support for version 5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [13:31:31] PROBLEM - puppet last run on roentgenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[rsyslog] [13:31:41] PROBLEM - SSH on silver is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:06] that's me roentgenium, poking at rsyslog [13:32:33] that is a name easy to remember [13:32:38] gh [13:32:39] (03Merged) 10jenkins-bot: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:32:48] (03CR) 10jenkins-bot: Cleanup old Wikibase echo test wiki configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382384 (owner: 10Aude) [13:33:27] Lucas_WMDE: your change is on mwdebug1002 [13:33:31] RECOVERY - SSH on silver is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:33:33] (i'm also checking) [13:34:01] aude: looks like it’s working, including the gadget [13:34:17] (03CR) 10Volans: [C: 031] "LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382413 (owner: 10Giuseppe Lavagetto) [13:34:32] I can see the new output structure in the network panel and the gadget still shows constraint violations on [[d:Q4115189]] [13:34:41] looks ok (new format) [13:34:50] * aude proceeds [13:34:58] elukey: praseodymium ! [13:35:09] on which page on translatewiki to add translate for name of sandbox? [13:36:06] (03PS1) 10Dereckson: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) [13:36:18] Zoranzoki21: I don't understand your question [13:37:21] aude: will be this one ^ [13:37:34] (03CR) 10jerkins-bot: [V: 04-1] Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:37:49] Dereckson: okof [13:37:50] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Remove test wiki echo configs from Wikibase config (duration: 01m 31s) [13:37:50] ok* [13:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:15] aude: seems to be live without debug headers now, thanks! [13:38:24] [15:35] on which page on translatewiki to add translate for name of sandbox? See screenshot for what I think: https://snag.gy/KsBZpN.jpg [13:38:40] Lucas_WMDE: thanks for checking [13:39:13] Dereckson: want me to deploy the throttle config? (or you prefer to do it?) [13:39:31] (03PS2) 10Zoranzoki21: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:40:26] the config looks ok to me [13:40:31] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:40:31] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:40:50] 10Operations, 10Wikimedia-Logstash: rsyslog on mw1180 seems to not use the logstash LVS endpoint - https://phabricator.wikimedia.org/T177833#3672009 (10Gehel) 05Open>03Resolved a:03Gehel Found it! Mediawiki seems to talk directly to logstash ([[ https://github.com/wikimedia/operations-mediawiki-config/bl... [13:40:51] aude: was ok [13:40:59] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:59] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:40:59] aude: Zoranzoki21 introduced a syntax error [13:41:27] Zoranzoki21: I think that is a gadget and cannot be translated in translatewiki.net. Try ?uselang=qqx [13:41:29] Zoranzoki21: please read http://php.net/manual/en/language.types.array.php [13:41:33] yeah :/ [13:41:44] (03CR) 10jerkins-bot: [V: 04-1] Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:41:48] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#3672012 (10MoritzMuehlenhoff) >>! In T177371#3657276, @faidon wrote: > We have at least another usage, the Ganeti key (cf. `modules/role/manifests/ganeti.pp`). This was for legacy reasons -- Ganeti didn't suppor... [13:41:59] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:41:59] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:10] Zoranzoki21: an array is a datastructure to be able to give more than one inforamtion, for example here several IP addresses [13:42:15] Zoranzoki21: [] is the delimiter of an array [13:42:18] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:19] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:42] (03PS3) 10Zoranzoki21: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:42:45] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#3672014 (10fgiunchedi) syslog-tls is deployed everywhere but esams (coming shortly) Traffic looks good to me, and cpu usage overall is around 15% on wezen/lithium. Connec... [13:43:48] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:43:58] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:44:01] (03CR) 10jerkins-bot: [V: 04-1] Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:44:19] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:44:23] (03PS1) 10Gehel: use the logstash LVS endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) [13:44:29] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:44:39] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:45:13] Zoranzoki21: may I fix it please? [13:45:58] Dereckson: please [13:46:29] RECOVERY - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is OK: DNS OK - 0.051 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [13:46:49] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:47:09] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3672019 (10jkroll) The wikidiff2 patch introduced new parameters with default values to `wikidiff2_do_diff()` and `w... [13:48:14] (03PS4) 10Dereckson: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) [13:50:06] (03CR) 10Aude: [C: 032] Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:50:12] thanks [13:50:25] I've added the change to the deployment table. [13:50:32] ok [13:51:36] (03Merged) 10jenkins-bot: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:51:49] (03CR) 10jenkins-bot: Add XLDB2017 workshop throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383354 (https://phabricator.wikimedia.org/T177835) (owner: 10Dereckson) [13:53:28] out of abundance of caution, it's on mwdebug1002 and looks good [13:53:31] * aude deploying [13:53:34] (03PS1) 10Filippo Giunchedi: Revert "hieradata: enable syslog over tls for eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/383356 [13:54:24] aude: thanks for the deploy [13:54:42] Urbanecm: nice your separation of rules / extension hook code by the way :) [13:55:43] 10Operations, 10HHVM, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#3672036 (10MoritzMuehlenhoff) I investigated the upgrade procedure for "provide icu57 in jessie and migrate before moving to stretch": This allo... [13:55:53] (03PS2) 10Filippo Giunchedi: Revert "hieradata: enable syslog over tls for eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/383356 [13:56:21] !log aude@tin Synchronized wmf-config/throttle.php: Add throttle rule T177835 (duration: 02m 16s) [13:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:28] T177835: Throttle rule for 2017-10-12 XLDB event - https://phabricator.wikimedia.org/T177835 [13:56:29] (03PS1) 10KartikMistry: Deploy Compact Language Links on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383357 (https://phabricator.wikimedia.org/T177836) [13:56:33] done [13:56:35] (03CR) 10Filippo Giunchedi: [C: 032] Revert "hieradata: enable syslog over tls for eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/383356 (owner: 10Filippo Giunchedi) [13:57:58] PROBLEM - SSH on silver is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:19] (03CR) 10KartikMistry: [C: 04-1] "Until, https://gerrit.wikimedia.org/r/#/c/364428/10 is deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383357 (https://phabricator.wikimedia.org/T177836) (owner: 10KartikMistry) [14:01:25] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [14:01:46] RECOVERY - SSH on silver is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [14:01:46] RECOVERY - Check for gridmaster host resolution TCP on labs-ns1.wikimedia.org is OK: DNS OK - 0.008 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [14:03:08] 10Operations, 10Cloud-Services: Wikimedia Cloud (labs) dns is intermittingly failing - https://phabricator.wikimedia.org/T177834#3672058 (10Andrew) 05Open>03Resolved a:03Andrew This seems to have been caused by https://gerrit.wikimedia.org/r/#/c/382415/, which has now been reverted. The labservices boxe... [14:07:15] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3672066 (10MoritzMuehlenhoff) >>! In T176637#3672019, @jkroll wrote: > It should be restarted after a native PHP ext... [14:08:54] (03PS2) 10Rush: labs: do not replicate wb_entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/382694 (https://phabricator.wikimedia.org/T95685) (owner: 10Ladsgroup) [14:14:14] is swat done? [14:14:38] marostegui: yes [14:14:42] \o/ [14:14:43] thanks [14:15:04] (03PS3) 10Marostegui: db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 [14:16:15] aude: on tin: Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded. [14:17:16] (03PS2) 10EddieGP: [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) [14:17:45] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [14:19:52] 10Operations, 10Puppet, 10cloud-services-team, 10User-Joe: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3672110 (10herron) [14:21:02] (03PS3) 10EddieGP: [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) [14:21:31] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [14:22:03] marostegui: hmm [14:22:12] checking [14:22:23] thanks [14:22:26] !log add druid public cluster's IPs to analytics-in4 on cr1/cr2 - T177511 [14:22:28] (on mediawiki-staging) [14:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:31] T177511: LVS for Druid - https://phabricator.wikimedia.org/T177511 [14:22:40] think i need to sync again [14:23:13] doing [14:23:23] think i forgot rebase after pulling on mwdebug [14:24:21] !log aude@tin Synchronized wmf-config/throttle.php: Add throttle rule T177835 (duration: 00m 47s) [14:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:28] T177835: Throttle rule for 2017-10-12 XLDB event - https://phabricator.wikimedia.org/T177835 [14:24:50] aude: looks good now :) [14:24:58] yeah [14:25:01] thanks for poking me [14:25:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 (owner: 10Marostegui) [14:25:10] (03CR) 10Volans: [C: 031] "Code seems ok, any easy way to test it?" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382649 (owner: 10Giuseppe Lavagetto) [14:25:19] aude: yw! thanks for fixing it [14:25:52] (03PS4) 10EddieGP: [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) [14:25:56] sure [14:26:05] (03CR) 10EddieGP: "I went for "twice a month" now (as bi-weekly is hard to express in a cron). I've grepped for existing "monthday" params in modules/mediawi" [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [14:26:20] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [14:26:20] * aude back later [14:26:22] (03PS2) 10Filippo Giunchedi: prometheus: add conntrack/entropy/edac collectors [puppet] - 10https://gerrit.wikimedia.org/r/382695 (https://phabricator.wikimedia.org/T177196) [14:26:53] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3672117 (10jkroll) >>! In T176637#3672066, @MoritzMuehlenhoff wrote: >>>! In T176637#3672019, @jkroll wrote: >> It s... [14:28:03] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 (owner: 10Marostegui) [14:28:13] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1072 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383327 (owner: 10Marostegui) [14:28:40] 10Operations, 10Cloud-Services: Wikimedia Cloud (labs) dns is intermittingly failing - https://phabricator.wikimedia.org/T177834#3672119 (10Paladox) thanks. [14:28:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383365 [14:28:46] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383365 [14:29:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 with low weight - T172679 (duration: 00m 47s) [14:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:21] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [14:31:38] !log Optimize table ores_classifications on db1080 - T159753 [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:44] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [14:32:34] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383365 (owner: 10Marostegui) [14:33:21] (03PS5) 10EddieGP: [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) [14:34:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383365 (owner: 10Marostegui) [14:35:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 - T174509 (duration: 00m 47s) [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:35:54] (03CR) 10EddieGP: "Also won the fight with code style now, sorry for the spam ;)" [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [14:35:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383365 (owner: 10Marostegui) [14:40:18] 10Operations, 10Puppet, 10User-Joe: Set up puppet catalog diff on host with access to puppetmaster1001 and puppetmaster2001 - https://phabricator.wikimedia.org/T177843#3672181 (10herron) [14:40:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383367 [14:40:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383367 [14:41:42] 10Operations, 10Puppet, 10cloud-services-team, 10User-Joe: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [14:42:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383367 (owner: 10Marostegui) [14:44:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383367 (owner: 10Marostegui) [14:45:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T174509 (duration: 00m 47s) [14:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:27] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:46:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383367 (owner: 10Marostegui) [14:46:46] PROBLEM - HP RAID on db2038 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Predictive Failure: 1I:1:8 - Controller: OK - Battery/Capacitor: OK [14:46:48] ACKNOWLEDGEMENT - HP RAID on db2038 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Predictive Failure: 1I:1:8 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177844 [14:46:51] 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10ops-monitoring-bot) [14:46:56] And it failed :( [14:47:22] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672205 (10Marostegui) [14:48:05] both or only one? [14:48:20] only one for now it seems [14:48:32] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672207 (10Marostegui) And one of them failed already: T177844 ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS,... [14:48:47] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui) This is being handled at: T177720 [14:49:00] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672215 (10Marostegui) [14:49:02] merge it [14:49:04] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) [14:49:45] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) 05duplicate>03Open [14:50:01] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) [14:50:03] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui) [14:51:43] (03PS2) 10Muehlenhoff: Provide a reboot wrapper for Cumin clients [puppet] - 10https://gerrit.wikimedia.org/r/383316 [14:52:06] PROBLEM - DPKG on ms-be2013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:52:45] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672225 (10Marostegui) @Papaul let us know if you were able to find disks to replace the (now) broken one and the one that will soon fail. Thanks! [14:53:06] RECOVERY - DPKG on ms-be2013 is OK: All packages OK [14:53:18] !log test stretch 9.2 upgrade on ms-be2013 - T177739 [14:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:24] T177739: Integrate stretch 9.2 point release - https://phabricator.wikimedia.org/T177739 [14:55:22] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672230 (10Marostegui) @Papaul db2010 which is scheduled for decommissioning (T175685) has the same chassis, so maybe it also has the same disks? [14:55:47] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672232 (10Papaul) @Marostegui I have some 600GB 15k that I can pull out off db2025. Just keep in mind that those are Dell disks [14:56:17] (03CR) 10Thcipriani: [C: 04-1] "In the current state this would make scap's MediaWiki canary checks fail." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:56:46] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3672233 (10Cmjohnson) Ticket submitted with Dell You have successfully submitted request SR955002039. [14:57:03] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672234 (10Marostegui) If db2025 is decommissioned, I would say let's go ahead... [14:57:19] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672236 (10Cmjohnson) The part is backordered, I will update ticket as soon I see it's shipped. [14:57:36] PROBLEM - DPKG on ms-fe2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:57:51] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672242 (10Marostegui) Thank you! [14:59:55] !log lvs1007 swapping NIC card [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:02] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672250 (10Marostegui) Btw, let's change just one disk at the time. [15:00:36] RECOVERY - DPKG on ms-fe2005 is OK: All packages OK [15:00:43] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672264 (10Papaul) ok I i will replaced first slot 1 [15:00:46] !log test stretch 9.2 upgrade on ms-fe2005 - T177739 [15:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] T177739: Integrate stretch 9.2 point release - https://phabricator.wikimedia.org/T177739 [15:01:17] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672266 (10Marostegui) Sounds good - thank you [15:01:36] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:46] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:01:46] PROBLEM - Swift HTTP frontend on ms-fe2005 is CRITICAL: connect to address 10.192.0.28 and port 80: Connection refused [15:01:46] PROBLEM - Swift HTTPS on ms-fe2005 is CRITICAL: connect to address 10.192.0.28 and port 80: Connection refused [15:02:16] PROBLEM - Swift HTTP backend on ms-fe2005 is CRITICAL: connect to address 10.192.0.28 and port 80: Connection refused [15:03:16] RECOVERY - Swift HTTP backend on ms-fe2005 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.090 second response time [15:03:46] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational [15:03:46] RECOVERY - Swift HTTP frontend on ms-fe2005 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.074 second response time [15:03:46] RECOVERY - Swift HTTPS on ms-fe2005 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.074 second response time [15:04:35] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672273 (10Papaul) Complete. Let me know when ready for slot 7 [15:05:24] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672276 (10Marostegui) Thanks, RAID rebuilding now: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [15:07:11] (03CR) 10Ema: [C: 032] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [15:07:36] (03PS14) 10Ema: varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) [15:07:38] (03CR) 10Ema: [V: 032 C: 032] varnish: add support for version 5 [puppet] - 10https://gerrit.wikimedia.org/r/382464 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [15:10:52] (03PS5) 10Thcipriani: Deployment pipeline profile [puppet] - 10https://gerrit.wikimedia.org/r/382608 (https://phabricator.wikimedia.org/T173128) [15:10:52] PROBLEM - Host lvs1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:13] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3672320 (10Nuria) [15:13:22] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89899.28 seconds [15:16:35] 10Operations, 10Research, 10Research-2017-18-Q2: Permissions to upload data to the analytics cluster from a machine at Drexel - https://phabricator.wikimedia.org/T177521#3672330 (10Halfak) [15:16:44] ACKNOWLEDGEMENT - HP RAID on db2038 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Predictive Failure: 1I:1:8 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177848 [15:16:47] 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177848#3672331 (10ops-monitoring-bot) [15:17:21] 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177848#3672338 (10Marostegui) 05Open>03declined We are handling it on: T177720 [15:18:03] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) @Papaul the rebuild for that disk has failed - can we try another spare disk maybe? [15:18:56] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672357 (10Papaul) ok [15:20:53] (03CR) 10Gehel: "If I understand correctly, this is about querying logstash, not sending logs to logstash. And this is accessing elasticsearch directly (on" [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [15:21:11] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:21:21] RECOVERY - Host lvs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [15:21:47] (03PS3) 10Gehel: logstash: all log producers need to use the logstash LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) [15:21:56] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672364 (10Papaul) done [15:22:39] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672365 (10Marostegui) Here we go again: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physica... [15:23:33] !log cp1008 upgraded to varnish 5 T168529 [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] T168529: Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529 [15:24:21] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[txqueuelen-eth3],Exec[txqueuelen-eth2] [15:26:54] (03CR) 10Filippo Giunchedi: "> > For more naming bikeshedding: since we're going to fold all" [puppet] - 10https://gerrit.wikimedia.org/r/382506 (https://phabricator.wikimedia.org/T177501) (owner: 10Eevans) [15:27:49] (03CR) 10Filippo Giunchedi: "Prometheus-equivalent dashboard https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:27:58] !log upgrading neodymium and sarin to mariadb-client 10.1.28 [15:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:26] (03CR) 10Thcipriani: [C: 031] "Works for me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [15:33:16] thcipriani: as I understand, the change above will only have impact on the next scap deployment? Anything you would like to check after I merge it? [15:33:27] (03PS4) 10Gehel: logstash: all log producers need to use the logstash LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) [15:34:55] gehel: sure, I can run a quick no-op deploy to ensure that everything is still working for scap as it should. [15:35:06] ok, I'll deploy right now... [15:35:19] (03CR) 10Gehel: [C: 032] logstash: all log producers need to use the logstash LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/383097 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [15:35:38] ok, let me know once puppet runs on tin and I'll give it a try [15:37:36] thcipriani: puppet run completed on tin [15:37:43] ok testing [15:39:34] !log thcipriani@tin Synchronized README: noop deployment to test new logstash-checker.py host (duration: 00m 47s) [15:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:44] gehel: works great :) [15:40:01] thcipriani: Thanks! We are one step closer to getting rid of logstash1001! [15:40:16] thanks for looking out for our tiny use-case, appreciated :) [15:40:38] and congrats [15:40:42] and thanks for reviewing it in time! [15:48:05] !log starting purge of commonswiki.recentchanges T177772 [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:11] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [15:48:59] (03PS1) 10Ema: varnish reload-vcl: avoid using columns in vcl_label [puppet] - 10https://gerrit.wikimedia.org/r/383369 (https://phabricator.wikimedia.org/T168529) [15:50:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672446 (10Papaul) [15:53:59] (03Abandoned) 10Filippo Giunchedi: rt-hacks: add maint-announce_add_to_gcal.js [software] - 10https://gerrit.wikimedia.org/r/353087 (owner: 10Filippo Giunchedi) [15:55:05] 10Operations, 10Traffic, 10Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3672450 (10BBlack) 05Open>03Resolved a:03BBlack The cache admission policy change seems to have gotten us over this for now. We should probably wait for t... [15:55:35] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672457 (10Papaul) a:05Papaul>03RobH [15:55:44] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3672458 (10Cmjohnson) Swapped the NIC card with a new one that HP sent. [15:56:15] 10Operations, 10Traffic, 10Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3672461 (10BBlack) [15:56:18] 10Operations, 10Traffic, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3672459 (10BBlack) 05Open>03Resolved a:03BBlack [15:56:43] 10Operations, 10Traffic, 10Patch-For-Review: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3672462 (10BBlack) 05Open>03Resolved a:03BBlack ^ The above seems to have resolved the esams-specific 503s. Closing this up! [16:00:05] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:21] * elukey waits godog's gif [16:02:01] elukey: https://i.imgur.com/yjnLHZM.mp4 [16:03:21] hahaah [16:03:28] I am going to create lag on codfw-s4 [16:03:39] I have downtime'd all alerts [16:03:40] !log milimetric@tin Started deploy [analytics/refinery@f4a0a33]: Deploying mostly for the new script in the bin folder [16:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:49] but so that you know from my previous log [16:05:29] I have also added a note on https://wikitech.wikimedia.org/wiki/Deployments [16:14:58] 10Operations, 10Wikimedia-Fundraising-CiviCRM, 10fundraising-tech-ops: mintaka disk space warning - https://phabricator.wikimedia.org/T177852#3672488 (10herron) [16:17:58] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3672524 (10herron) [16:19:41] !log milimetric@tin Finished deploy [analytics/refinery@f4a0a33]: Deploying mostly for the new script in the bin folder (duration: 16m 01s) [16:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:17] 10Operations, 10Pybal, 10Traffic, 10netops, 10Patch-For-Review: Deploy pybal with BGP MED support (for primary/backup) in production - https://phabricator.wikimedia.org/T165584#3672568 (10ayounsi) We need to cleanup this specific term, now that the LVS advertise the MED themselves. ```delete policy-optio... [16:34:18] (03PS1) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/383375 (https://phabricator.wikimedia.org/T158583) [16:34:51] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3672524 (10cwdent) @herron Thanks for the heads up! Just curious, how did you notice that? Thought Jeff and I were the only ones to get alerted about frack stuff any... [16:34:54] (03CR) 10jerkins-bot: [V: 04-1] Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/383375 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [16:36:10] (03PS3) 10Gehel: [WIP] maps: move to vector tiles and cleartables [puppet] - 10https://gerrit.wikimedia.org/r/378245 (https://phabricator.wikimedia.org/T157613) [16:36:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps: move to vector tiles and cleartables [puppet] - 10https://gerrit.wikimedia.org/r/378245 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [16:36:41] RECOVERY - puppet last run on roentgenium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:38:31] (03PS1) 10Ladsgroup: labs: Add $wgWBRepoSettings['canonicalUriProperty'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383376 (https://phabricator.wikimedia.org/T177857) [16:38:53] can I something to the puppet swat now? [16:39:10] godog, moritzm, _joe_ ? [16:39:22] (03PS2) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/383375 (https://phabricator.wikimedia.org/T158583) [16:39:38] (03PS1) 10Jdlrobson: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) [16:40:52] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3672670 (10herron) Hey @cwdent, I just noticed the disk alert listed in the icinga "unhandled services" board and created this to acknowledge it. So the icinga number... [16:41:01] FYI: It's https://gerrit.wikimedia.org/r/#/c/383334/ [16:47:24] db1064 seems now lagging a bit [16:48:33] seems the replication control kicked in [16:49:25] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3672706 (10cwdent) @herron ah yes of course, pardon my brain fart :P We are aggregating kafka data on that disk right now so it is extra full, but will be freeing it... [16:50:29] it is not pooled for main traffic, + long running querys running [16:50:36] I will downtime it and ignore it [16:53:25] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3672714 (10herron) Sounds good! [16:55:04] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: move frav1001's to the frack-fundraising VLAN so we can use it for database testing - https://phabricator.wikimedia.org/T176492#3672724 (10Cmjohnson) [16:55:07] 10Operations, 10ops-eqiad: relabel WMF3083 as frdb1003 - https://phabricator.wikimedia.org/T176507#3672722 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T1700). [17:00:06] No GERRIT patches in the queue for this window AFAICS. [17:00:21] no parsoid deploy today [17:00:23] No ORES today :( Soon though. SOON [17:00:56] (03PS1) 10BBlack: Depool esams for expected blips during ASN renumbering [dns] - 10https://gerrit.wikimedia.org/r/383382 (https://phabricator.wikimedia.org/T167840) [17:01:58] anyone that will be around later deployments? kibana is full of db1064 replication lag errors- that is normal, there is maintenance ongoing [17:02:20] see SAL, but mostly ignore those, unless a page/error shoes here [17:02:44] please transmit to late (for me) deployers [17:02:47] so they don't worry [17:15:28] (03CR) 10Herron: [C: 032] admin: new ssh key for Ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/383334 (owner: 10Ladsgroup) [17:15:36] (03PS2) 10Herron: admin: new ssh key for Ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/383334 (owner: 10Ladsgroup) [17:20:09] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672822 (10Marostegui) @Papaul the disk went fine, can you change the other one pending now? ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1,... [17:25:21] !log krinkle@tin Synchronized php-1.31.0-wmf.2/resources/lib/jquery.ui/: I717f2580e3aae (duration: 00m 47s) [17:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:22] !log shutting down and restarting elasticsearch on relforge1001 for testing - T170378 [17:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:29] T170378: Investigate why elastic@codfw alerted during codfw row B switch upgrade - https://phabricator.wikimedia.org/T170378 [17:26:43] ACKNOWLEDGEMENT - HP RAID on db2038 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:8 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177862 [17:26:47] 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177862#3672853 (10ops-monitoring-bot) [17:27:16] (03PS1) 10Cmjohnson: Adding mgmt dns entries for db1107/1108 T177405 [dns] - 10https://gerrit.wikimedia.org/r/383385 [17:27:17] 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177862#3672860 (10Marostegui) 05Open>03declined Being handled at: T177720 [17:27:36] I’m getting consistent failures trying to clone mediawiki-config, from two different environments. [17:27:47] error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated. [17:27:52] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for db1107/1108 T177405 [dns] - 10https://gerrit.wikimedia.org/r/383385 (owner: 10Cmjohnson) [17:28:01] error: RPC failed; curl 56 SSLRead() return error -980600 KiB/s [17:28:07] 10Operations, 10ops-codfw, 10DBA: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) I can see it is rebuilding now - thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 6... [17:31:16] (03CR) 10Ayounsi: [C: 032] Depool esams for expected blips during ASN renumbering [dns] - 10https://gerrit.wikimedia.org/r/383382 (https://phabricator.wikimedia.org/T167840) (owner: 10BBlack) [17:31:17] (03PS2) 10Ayounsi: Depool esams for expected blips during ASN renumbering [dns] - 10https://gerrit.wikimedia.org/r/383382 (https://phabricator.wikimedia.org/T167840) (owner: 10BBlack) [17:32:37] !log depooling esams from DNS for T167840 [17:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:44] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [17:41:13] (03CR) 10BryanDavis: [C: 031] "Looks ok to me. Giving nuke to contentadmin especially seems like it will be helpful for our users." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) (owner: 10EddieGP) [17:51:41] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3672942 (10Dzahn) >>! In T177225#3669570, @akosiaris wrote: > it got me thinking how are we going to clean up the fleet ? Should we use puppet (so all these changes require amending)... [17:54:38] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3672952 (10Dzahn) >>! In T177225#3670115, @faidon wrote: > equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken... [17:55:00] (03PS1) 10Dduvall: Dummy secret config.json for docker-pusher script [labs/private] - 10https://gerrit.wikimedia.org/r/383386 [17:58:19] (03CR) 10Dduvall: Dummy secret config.json for docker-pusher script (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/383386 (owner: 10Dduvall) [17:59:36] (03PS1) 10Nuria: Do not store PopUps events on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/383389 (https://phabricator.wikimedia.org/T176469) [17:59:51] (03CR) 10jerkins-bot: [V: 04-1] Do not store PopUps events on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/383389 (https://phabricator.wikimedia.org/T176469) (owner: 10Nuria) [18:00:11] (03PS1) 10Jdlrobson: Enable Vector print logo on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) [18:01:17] (03CR) 10Dzahn: "Gehel, thanks! just to confirm, do we have equivalent / prometheus/grafana monitoring for this before we remove it?" [puppet] - 10https://gerrit.wikimedia.org/r/382927 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:02:48] (03CR) 10Gehel: [C: 031] "At least, we have everything we need in graphite. I have not looked at Ganglia in ages, and I have not missed anything." [puppet] - 10https://gerrit.wikimedia.org/r/382927 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:03:40] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3672980 (10Cmjohnson) [18:07:26] (03PS6) 10Thcipriani: Deployment pipeline profile [puppet] - 10https://gerrit.wikimedia.org/r/382608 (https://phabricator.wikimedia.org/T173128) [18:08:34] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3673007 (10Legoktm) >>! In T177276#3671190, @Joe wrote: > * There is no need for cache busters as we ignore cache at image build time. T... [18:11:30] (03PS2) 10Dzahn: elasticsearch/logstash: drop ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382927 (https://phabricator.wikimedia.org/T177225) [18:13:50] (03CR) 10Dzahn: [C: 032] elasticsearch/logstash: drop ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382927 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:17:22] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/ganglia/python_modules/elasticsearch_monitoring.py] [18:18:06] well, that would be me i guess. on it [18:18:25] only expected logstash1001-1004 to be affected [18:19:11] but they arent, while 1009 doesnt even the role i touched and has no issue now.. [18:21:19] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/383389 (https://phabricator.wikimedia.org/T176469) (owner: 10Nuria) [18:22:22] RECOVERY - puppet last run on logstash1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:22:42] PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:22:42] PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:11] PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:11] PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:14] mutante: the usual transient failure when you remove a puppet://... file... we should use templates everywhere... [18:23:31] PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:31] PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:32] PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:32] PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds [18:23:44] gehel: yes, well, there i s a part about it like. t removes it without me having to use cumin, which i was planning to :) [18:24:11] checking though for remnants [18:24:12] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:44] also runs puppet on stat1006 but nothing related here [18:25:11] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational [18:25:11] RECOVERY - configured eth on stat1006 is OK: OK - interfaces up [18:25:32] RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient [18:25:32] RECOVERY - Disk space on stat1006 is OK: DISK OK [18:25:32] RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [18:25:32] RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set [18:25:42] RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full [18:25:42] RECOVERY - DPKG on stat1006 is OK: All packages OK [18:25:43] shrug [18:27:04] Hi [18:27:15] sup [18:27:21] Wondering who is doing the train deployment today? [18:27:21] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:27:36] jouncebot: next [18:27:36] In 0 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T1900) [18:27:48] I created wmf.3 branch of wikidata build but forgot to update release tools [18:28:00] audephone: it says Chad does it [18:28:04] Ok [18:28:43] Either someone can do that for me else I can bump wikidata later and run scap (after evening swat when I'm home) [18:29:00] 10Operations, 10netops, 10Patch-For-Review: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3673052 (10ayounsi) After depooling esams the Telia link in eqiad started to saturate, I added the following terms to temporary ease out that link. ``` [edit policy-options as-path-group AVOID-P... [18:29:20] Either way not a big deal to me [18:29:27] no_justification: ^ [18:41:19] !log logstash*: delete /usr/lib/ganglia/python_modules/elasticsearch_monitoring.py and /etc/ganglia/conf.d/elasticsearch.pyconf via cumin (T177225) [18:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:27] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [18:41:36] (03CR) 10Dzahn: "14:41 < mutante> !log logstash*: delete /usr/lib/ganglia/python_modules/elasticsearch_monitoring.py and /etc/ganglia/conf.d/elasticsearch." [puppet] - 10https://gerrit.wikimedia.org/r/382927 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:47:09] (03PS2) 10Dzahn: cumin: drop ganglia::web role alias [puppet] - 10https://gerrit.wikimedia.org/r/382920 (https://phabricator.wikimedia.org/T177225) [18:48:55] (03CR) 10Dzahn: [C: 032] cumin: drop ganglia::web role alias [puppet] - 10https://gerrit.wikimedia.org/r/382920 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:52:14] (03PS1) 10Krinkle: noc: Remove unused bugzilla-ish global.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383394 [18:52:16] (03PS1) 10Krinkle: noc: Remove bugzilla-ish css/index.css and images/index/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383395 [18:52:22] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:53:13] no_justification: ok to roll out a quick noc cleanup? [18:53:41] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3673222 (10ovasileva) [18:54:12] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:55:13] (03PS2) 10Dzahn: zookeeper: update comment about ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/382921 (https://phabricator.wikimedia.org/T177225) [18:56:21] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] [18:57:21] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] [18:57:32] (03CR) 10Dzahn: "thanks Elukey, that's what i wanted to find out. Also after seeing Ottomata's comment on T177225#3669650. I simply amended it to update t" [puppet] - 10https://gerrit.wikimedia.org/r/382921 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:57:41] (03PS3) 10Dzahn: zookeeper: update comment about ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/382921 (https://phabricator.wikimedia.org/T177225) [18:59:09] (03CR) 10Dzahn: [C: 032] "just a comment now" [puppet] - 10https://gerrit.wikimedia.org/r/382921 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:00:04] no_justification: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:28] !log starting the work for T167840 [19:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:35] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [19:00:57] (03PS1) 10Krinkle: noc: Remove unused Vector CSS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383396 [19:01:31] (03PS11) 10Dzahn: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [19:01:58] (03CR) 10Krinkle: [C: 032] noc: Remove unused bugzilla-ish global.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383394 (owner: 10Krinkle) [19:01:59] (03CR) 10Krinkle: [C: 032] noc: Remove bugzilla-ish css/index.css and images/index/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383395 (owner: 10Krinkle) [19:02:01] (03CR) 10Krinkle: [C: 032] noc: Remove unused Vector CSS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383396 (owner: 10Krinkle) [19:03:24] (03Merged) 10jenkins-bot: noc: Remove unused bugzilla-ish global.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383394 (owner: 10Krinkle) [19:03:29] (03CR) 10jerkins-bot: [V: 04-1] noc: Remove unused Vector CSS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383396 (owner: 10Krinkle) [19:03:37] (03CR) 10jenkins-bot: noc: Remove unused bugzilla-ish global.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383394 (owner: 10Krinkle) [19:04:06] (03PS2) 10Krinkle: noc: Remove bugzilla-ish css/index.css and images/index/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383395 [19:04:10] (03PS2) 10Krinkle: noc: Remove unused Vector CSS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383396 [19:04:12] @seen hashar [19:04:12] mutante: Last time I saw hashar they were quitting the network with reason: Read error: Connection reset by peer N/A at 10/9/2017 8:43:21 PM (22h20m51s ago) [19:06:48] (03CR) 10jenkins-bot: noc: Remove bugzilla-ish css/index.css and images/index/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383395 (owner: 10Krinkle) [19:07:23] (03CR) 10Dzahn: [C: 032] "per "Applied on beta cluster puppetmaster for >1 year"" [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [19:07:41] !log krinkle@tin Synchronized docroot/noc/: Clean up noc/index CSS (duration: 00m 47s) [19:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:15] (03CR) 10jenkins-bot: noc: Remove unused Vector CSS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383396 (owner: 10Krinkle) [19:10:10] (03CR) 10Dzahn: "the parent patch of this has now been merged on prod puppetmaster, so that doesn't need to be cherry-picked anymore and the steps describe" [puppet] - 10https://gerrit.wikimedia.org/r/330412 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [19:10:55] (03CR) 10Dzahn: "now the follow-up https://gerrit.wikimedia.org/r/#/c/330412/4 can be done" [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [19:16:21] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89890.87 seconds [19:24:18] (03PS1) 10Gehel: maps: isolate maps-test2004 to test vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/383398 (https://phabricator.wikimedia.org/T153282) [19:24:49] (03CR) 10jerkins-bot: [V: 04-1] maps: isolate maps-test2004 to test vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/383398 (https://phabricator.wikimedia.org/T153282) (owner: 10Gehel) [19:25:32] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:02] PROBLEM - Host labweb1002 is DOWN: CRITICAL - Host Unreachable (208.80.155.109) [19:26:21] PROBLEM - Host labweb1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:49] (03PS1) 10Krinkle: noc: Make css/base.css shared between index, db and conf views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383399 [19:28:08] (03CR) 10Krinkle: [C: 032] "Compared rendering before/after of each of the three end points. Same, except for minor line-height differences which seems worth unifying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383399 (owner: 10Krinkle) [19:29:27] (03Merged) 10jenkins-bot: noc: Make css/base.css shared between index, db and conf views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383399 (owner: 10Krinkle) [19:29:39] (03CR) 10jenkins-bot: noc: Make css/base.css shared between index, db and conf views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383399 (owner: 10Krinkle) [19:30:27] !log krinkle@tin Synchronized docroot/noc/: Clean up noc base.css (duration: 00m 48s) [19:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:12] (03PS1) 10Herron: Add puppetcompiler1001 forward and reverse DNS records [dns] - 10https://gerrit.wikimedia.org/r/383400 (https://phabricator.wikimedia.org/T177843) [19:34:44] oh, we are getting puppetcompiler dedicated host in prod? nice [19:38:48] (03PS1) 10Merlijn van Deen: Extend webservice -h details [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/383401 [19:43:41] PROBLEM - Host cp3005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:41] PROBLEM - Host cp3030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:42] PROBLEM - Host cp3037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:42] PROBLEM - Host cp3006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:43] PROBLEM - Host cp3042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:43] PROBLEM - Host cp3010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:43:44] PROBLEM - Host cp3043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:17] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:44:17] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [19:44:37] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:44:43] <_joe_> wat [19:44:46] PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host maerlant.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host cp3032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host cp3038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host lvs3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host cp3036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] PROBLEM - Host lvs3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:44:51] expectec [19:44:52] XioNoX: ois that you? [19:44:53] esams is depooled, XioNoX is working on it [19:44:54] *expected [19:44:56] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2456 bytes in 0.342 second response time [19:45:02] ok thanks bblack [19:45:07] RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 44%, RTA = 83.90 ms [19:45:26] RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.87 ms [19:45:27] <_joe_> downtiming would've been better :) [19:45:41] we've had that debate before lol [19:45:51] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [19:45:51] hey [19:45:57] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms [19:45:57] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.83 ms [19:46:03] I downtimed some hosts, but not all of them [19:46:17] mostly the network stuff actually [19:46:19] glad that all os ok :) [19:46:21] PROBLEM - Host re0.cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [19:46:21] PROBLEM - Host re0.cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [19:46:31] PROBLEM - Host ms-be3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:31] PROBLEM - Host ms-be3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:31] PROBLEM - Host ms-fe3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:31] PROBLEM - Host ms-be3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:31] PROBLEM - Host ms-be3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:32] PROBLEM - Host multatuli.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:32] PROBLEM - Host ms-fe3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:35] yeah the host alerts are all .mgmt. [19:47:11] PROBLEM - Host cp3033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:11] PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:11] PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:11] PROBLEM - Host nescio.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:11] PROBLEM - Host cp3041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:11] PROBLEM - Host cp3049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:20] who cares about the hosts, it's the LVS services that page ... [19:47:21] PROBLEM - Host bast3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:47:26] is it just me or are our sites having trouble? i'm having like 50% packed lott to gerrit.wikimedia.org [19:47:33] packet loss* [19:47:38] MatmaRex: from where? [19:47:43] ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177875 [19:47:44] Poland [19:47:46] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T177875#3673505 (10ops-monitoring-bot) [19:47:55] MatmaRex: what IP is gerrit.wikimedia.org resolving to for you? [19:48:14] bblack: 208.80.154.85 [19:48:17] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2421 bytes in 0.341 second response time [19:48:18] someone else reported issues with cloning from that url. [19:48:27] oh actually, nevermind that, gerrit isn't through misc-web heh [19:48:28] gerrit isn't behind misc-lb, I think? [19:48:37] MatmaRex: could you provide a traceroute? [19:48:46] it's possible eqiad is congested from all the extra traffic [19:48:48] surely we don't pick up eqiad traffic for IPs like gerrit's up in esams? [19:49:06] one port on cr2-eqiad saturated [19:49:08] we don't [19:49:08] sure, doing [19:49:11] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:21] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:31] mark: which one? [19:49:31] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:49:31] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:32] xe-4/3/1 [19:49:32] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:32] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:39] (03PS11) 10Amire80: Remove compact language links dblist for simplicity (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [19:49:41] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:41] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 91 connecting: cp3007_v4, cp3008_v4, cp3010_v4, cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:42] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:51] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:49:51] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:49:51] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:49:51] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:49:52] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:01] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:01] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:01] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:02] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:02] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 11 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:02] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:11] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:11] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:11] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:12] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:12] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 11 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:12] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:12] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:14] mark: I routed some traffic away from that link so it doesn't saturate. It's close to max bw though: https://librenms.wikimedia.org/graphs/id=6841/type=port_bits/from=1507643100/to=1507664700/ [19:50:21] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:21] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 11 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:21] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:21] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:22] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:22] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:22] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:23] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:28] blergh [19:50:31] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:31] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:31] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:31] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:32] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:32] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:32] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:32] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:40] yay ipsec [19:50:41] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:41] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:41] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:42] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp3030_v4, cp3031_v4, cp3032_v4, cp3033_v4, cp3040_v4, cp3041_v4, cp3042_v4, cp3043_v4 [19:50:42] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:42] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 11 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:50:42] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp3034_v4, cp3035_v4, cp3036_v4, cp3037_v4, cp3038_v4, cp3039_v4, cp3044_v4, cp3045_v4, cp3046_v4, cp3047_v4, cp3048_v4, cp3049_v4 [19:50:43] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3007_v4, cp3008_v4, cp3010_v4 [19:51:27] XioNoX: need any help? [19:51:41] paravoid: bblack: traceroute here https://pastebin.com/14hvBYF5 [19:52:21] yeah so MatmaRex is coming over telia in eqiad in the traceroute it looks like [19:52:24] which is that link [19:52:26] yes [19:52:48] MatmaRex: and what is your IP? [19:53:06] (down to /24 or in private is fine as well) [19:53:07] btw, aside from the mgmts and the LVS services, hosts and services alerts could have been avoided by echo "profile::base::notifications_enabled: '0'" in hieradata/esams.yaml [19:53:20] paravoid: 185.157.12.102 [19:53:29] yeah, back through telia as well [19:53:31] echo "profile::base::notifications_enabled: '0'" > hieradata/esams.yaml more correctly [19:53:39] just as a note for next time [19:53:39] MatmaRex: are you still seeing issues? [19:54:09] please !log all actions [19:54:31] paravoid: yes [19:55:36] hrm [19:56:11] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:56:50] traceroute shows issues in hop 10 or thereabouts, in cogent's network [19:56:58] could be a red herring of course and just ICMP rate limiting [19:57:10] but 0 packet loss up until that hop, so.. [19:57:42] the telia interface doesn't seem congested even in by-second internals ("monitor interface ...") [19:58:11] so esams advertises its routes to eqiad, but doesn't learn eqiad's (and the rest of the infra) routes [19:58:27] ospf? [19:58:36] (03PS1) 10BBlack: geodns: US+CA prefer codfw over eqiad [dns] - 10https://gerrit.wikimedia.org/r/383402 [19:58:37] ibgp? [19:58:49] ^ an option for reducing saturation in eqiad: move all the US customers over to codfw [19:59:04] (leaving eqiad mostly for the esams traffic) [19:59:04] ebgp between eqiad and esams [19:59:16] investigating [19:59:18] confed ebgp? which routes? [19:59:26] what have you done so far? :) [20:00:12] everything before "Initiate AS# change on AMS-IX portal" on https://phabricator.wikimedia.org/T167840 [20:00:17] (03CR) 10Bmansurov: [C: 031] Enable Vector print logo on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) (owner: 10Jdlrobson) [20:02:07] actually, only 208.80.152.0/22 is being installed on the ams side, not 198.73.209.0/24 [20:02:56] paravoid: bblack: hmm, it seems to have improved just now [20:03:04] something you did or magic? [20:03:06] I haven't touched anything, only observing [20:03:11] I think it was cogent [20:03:38] XioNoX: what's the problem that you're seeing specifically? [20:03:55] no more packet loss in the last minute or so [20:04:32] paravoid: esams, doesn't see the 10/8 routes nor 198.73.209.0/24 [20:05:01] forget 198.73.209.0/24 for now [20:05:07] which esams router, and what prefix specifically? [20:05:11] and over what protocol [20:05:37] so yeah, my problem seems resolved. thanks for the help. (current traceroute is: https://pastebin.com/qncVQ6Ee) [20:06:16] taking cr2-esams as example, ebgp from cr2-eqiad, Im using 10.64.32.0/22 as test prefix [20:06:44] that prefix should be over OSPF, not eBGP, no? [20:06:49] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3673630 (10Dzahn) @Tobi_WMDE_SW Ok, got it. Thanks for explaining and confirming that. I could also confirm now that Pablo has signed L2. [20:06:51] RECOVERY - HP RAID on db2038 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [20:07:11] akosiaris@re0.cr2-esams> show route 10.2.2.15 returns Discard [20:07:21] LVS are a different matter [20:07:24] these are over eBGP [20:07:49] still, it should not return discard, right ? [20:07:54] it shouldn't, no [20:08:03] or we don't readvertise the to other DCs ? [20:08:09] s/the/them/ [20:08:21] we do AFAIK [20:09:10] paravoid: part of the plan for T167840 is to use ebgp instead of stretching ospf across the atlantic [20:09:10] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [20:09:25] I think I found the filter that needs to be updated [20:09:39] what is the problem with stretching ospf across the atlantic? [20:10:04] ebgp is wayy slower [20:10:18] (this can wait) [20:10:30] oh, I thought this was a separate task [20:10:55] https://phabricator.wikimedia.org/T167841#3352281 [20:10:57] well the route's still there via OSPF, just not preferred to the aggregate->discard [20:11:07] bblack: which route? [20:11:17] akosiaris@re0.cr2-esams> show route 10.64.32.0 [20:11:23] ^ that one [20:11:35] that's results into just the 0/0 route [20:11:37] oh nevermind, reading fail on my part [20:11:39] it's still 0/0 [20:11:45] yes, I didn't look for the 0/0 :) [20:11:50] :) [20:12:29] export BGP_Wikimedia_own_space; [20:12:39] mark, paravoid, we can either switch back to using ospf, or update the BGP filters to esams to allow 10/0 routes to be propagated [20:13:20] that wouldn't allow neither "customer" routes, nor 10/8 routes [20:13:37] oh nm, customer routes are there now [20:13:46] but 10/8 wouldn't be [20:13:54] I'd say switch back to OSPF for now [20:14:09] just to make each migration separate [20:16:09] so either way, the issue we're discussing now should just affect the depooled esams. The issue earlier with packet loss from EU->eqiad: "it was cogent"? [20:16:19] I think it was cogent, yes [20:16:25] meaning they just happened to screw something up during our work, or? [20:16:43] or our work saturated one of their links [20:16:44] or they temporarily didn't handle the big shift in our traffic very well? [20:16:48] yeah :) [20:16:55] ok [20:17:12] traceroute shows 0% now where it showed packet loss before [20:17:20] so it wasn't ICMP rate limiting before, but a real issue [20:19:41] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [20:19:42] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [20:19:42] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [20:19:42] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [20:19:43] RECOVERY - Host cp3038.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 84.95 ms [20:19:43] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [20:19:43] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [20:19:43] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [20:19:43] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [20:19:46] ospf re-enabled on one link [20:19:48] yeah [20:19:49] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16778 bytes in 0.502 second response time [20:19:51] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 84.46 ms [20:19:51] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [20:20:01] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [20:20:01] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 114 ESP OK [20:20:01] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [20:20:02] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [20:20:02] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [20:20:02] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 26 ESP OK [20:20:02] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [20:20:02] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [20:20:02] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [20:20:03] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [20:20:03] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [20:20:04] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 26 ESP OK [20:20:21] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [20:20:21] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 26 ESP OK [20:20:22] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [20:20:25] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [20:20:25] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [20:20:25] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [20:20:25] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [20:20:31] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [20:20:31] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [20:20:31] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 26 ESP OK [20:20:32] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [20:20:32] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [20:20:32] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [20:20:32] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [20:20:32] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [20:20:33] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [20:20:33] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [20:20:41] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [20:20:47] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16776 bytes in 0.504 second response time [20:20:49] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [20:20:49] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [20:20:49] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [20:20:49] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [20:20:49] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [20:20:49] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [20:20:49] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [20:20:51] RECOVERY - Host cp3044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.68 ms [20:20:51] RECOVERY - Host cp3043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.05 ms [20:21:02] RECOVERY - Host cp3031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.99 ms [20:21:02] RECOVERY - Host cp3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.53 ms [20:21:02] RECOVERY - Host cp3005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.77 ms [20:21:02] RECOVERY - Host cp3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.47 ms [20:21:02] RECOVERY - Host cp3008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.07 ms [20:21:02] RECOVERY - Host cp3030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.07 ms [20:21:02] RECOVERY - Host cp3040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.02 ms [20:21:03] RECOVERY - Host cp3037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.94 ms [20:21:03] RECOVERY - Host cp3006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.18 ms [20:21:04] RECOVERY - Host cp3042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.68 ms [20:21:04] RECOVERY - Host cp3048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.55 ms [20:21:05] RECOVERY - Host cp3010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.40 ms [20:22:02] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177881 [20:22:03] RECOVERY - Host re0.cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.50 ms [20:22:03] RECOVERY - Host re0.cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.73 ms [20:22:03] RECOVERY - Host maerlant.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.62 ms [20:22:03] RECOVERY - Host cp3032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.91 ms [20:22:03] RECOVERY - Host lvs3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.78 ms [20:22:03] RECOVERY - Host lvs3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.67 ms [20:22:04] RECOVERY - Host cp3036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [20:22:06] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T177881#3673693 (10ops-monitoring-bot) [20:23:42] RECOVERY - Host ms-be3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.74 ms [20:23:42] RECOVERY - Host ms-be3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.89 ms [20:23:42] RECOVERY - Host ms-fe3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.49 ms [20:23:42] RECOVERY - Host ms-be3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.88 ms [20:23:42] RECOVERY - Host ms-be3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.92 ms [20:23:42] RECOVERY - Host ms-fe3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [20:23:42] RECOVERY - Host multatuli.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.98 ms [20:24:02] RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.44 ms [20:24:21] RECOVERY - Host nescio.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.56 ms [20:24:21] RECOVERY - Host cp3033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.68 ms [20:24:21] RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.46 ms [20:24:22] RECOVERY - Host cp3041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.49 ms [20:24:22] RECOVERY - Host cp3049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.64 ms [20:24:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:24:33] RECOVERY - Host bast3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.54 ms [20:26:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:26:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [1000.0] [20:32:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:33:32] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:39:07] (03PS1) 10Hoo man: Increase the shard count for Wikidata entity dumps from 5 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/383414 (https://phabricator.wikimedia.org/T177486) [20:46:51] (03CR) 10Herron: [C: 032] Add puppetcompiler1001 forward and reverse DNS records [dns] - 10https://gerrit.wikimedia.org/r/383400 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [20:49:14] I'm pretty much done with esams [20:49:20] all the tests are successful so far [20:53:40] bblack, paravoid, anything else you want to test before repooling esams? Or let it sit a bit more? [20:55:01] did you renumber ams-ix as well? [20:55:05] or is that later? [20:56:13] I did it [20:56:40] we accept BGP on both old and new IP as well as old and new AS# the time everybody transitions to the new AS# [20:56:59] oh, cool [20:57:35] XioNoX: is there any good reason to doubt the stability of the new esams? [20:58:38] (03PS1) 10Hoo man: Move WB client "disabledUsageAspects" setting into $wmgWikibaseDisabledUsageAspects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383439 (https://phabricator.wikimedia.org/T151717) [20:58:40] (03PS1) 10Hoo man: Enable Statement usage tracking on cawiki and cewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383440 (https://phabricator.wikimedia.org/T151717) [20:59:16] bblack: no [21:01:19] so, we're still on the downslope of esams traffic for the day [21:02:09] it'll be getting near the daily minimum in a few hours, and begins slowly ramping back up signficantly by ~7h from now [21:02:59] I think may as well turn it on now [21:03:46] (but I could probably invent a plausible-sounding theory for why it's best to do it: now, 3h from now, 7h from now, etc...) [21:05:14] haha [21:05:16] the upside of now is we have more hours of normal daytime for USians to react to whatever might happen. if we do it later it will fall in the relatively-dead-zone of ops coverage [21:05:25] good point [21:05:46] paravoid: can you delete https://www.peeringdb.com/net/14249 ? and update the address on https://www.peeringdb.com/org/1489 ? [21:07:08] XioNoX: the former I can't do, there is no delete button [21:07:21] I think they automatically imported it from the RIPE db [21:07:39] ah okay, not like it's urgent anyway [21:08:03] huh [21:08:06] look at that, I did it :P [21:08:12] did both [21:08:21] (and not sure, if you saw, fixed our AMS-IX ASN/IPv6 earlier) [21:09:54] yeah, thanks1 [21:10:47] paravoid: in the "private peering facility" list, we need to update the AS# as well [21:11:16] oh yeah [21:11:19] the form is pretty stupid tho [21:11:38] and check the RS peer box where needed [21:11:51] huh [21:12:10] I can't change the ASN of a facility, nor include one when adding a facility [21:12:20] but when I add (again) evoswitch, it shows up as 14907 automatically [21:12:31] oh [21:12:35] (03PS1) 10BBlack: Revert "Depool esams for expected blips during ASN renumbering" [dns] - 10https://gerrit.wikimedia.org/r/383462 [21:12:39] (03PS2) 10BBlack: Revert "Depool esams for expected blips during ASN renumbering" [dns] - 10https://gerrit.wikimedia.org/r/383462 [21:13:16] ok, should be fixed I think [21:13:36] (03PS1) 10Smalyshev: Add configuration for statement indexing for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383464 (https://phabricator.wikimedia.org/T175199) [21:13:36] great! [21:13:54] XioNoX: ready for traffic? [21:14:00] bblack: yup [21:14:12] (03CR) 10BBlack: [C: 032] Revert "Depool esams for expected blips during ASN renumbering" [dns] - 10https://gerrit.wikimedia.org/r/383462 (owner: 10BBlack) [21:14:59] !log esams repooling - T167840 [21:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:06] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [21:15:41] XioNoX: you should have access to peeringdb too btw [21:15:45] (but happy to be doing those changes) [21:15:48] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for statement indexing for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383464 (https://phabricator.wikimedia.org/T175199) (owner: 10Smalyshev) [21:16:28] paravoid: I see that I'm "affiliated" but don't see any edit buttons [21:16:56] see pm [21:17:05] ah okay, thx! [21:17:43] (03PS2) 10Smalyshev: Add configuration for statement indexing for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383464 (https://phabricator.wikimedia.org/T175199) [21:20:14] added SG3 to peeringdb too [21:20:45] not the IXP though, that requires IPv4/IPv6 [21:27:26] (03PS1) 10MaxSem: Switch test wikis to HTML5 fragment mode in links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383473 (https://phabricator.wikimedia.org/T152540) [21:33:15] heh found our IXP addresses [21:33:18] so added that as well [21:35:47] for equinix@sg? [21:35:50] yes [21:35:52] neat [21:36:04] did they merge SG3 into our main equinix portal account btw? [21:36:08] https://as14907.peeringdb.com/ [21:36:15] it's there on the IX portal [21:36:22] which is a bit separate [21:36:56] doesn't seem to be the case for the main portal yet [21:37:22] ok, they asked earlier today and I said yes [21:53:15] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3673991 (10Dzahn) I have contacted WMF legal to reach out to Pablo so he can sign the right NDA. I asked and L2 is only for Phabricator access to... [21:57:02] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3673997 (10Addshore) >>! In T177599#3673991, @Dzahn wrote: > I have contacted WMF legal to reach out to Pablo so he can sign the right NDA. I ask... [21:58:52] 10Operations, 10netops, 10Patch-For-Review: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3674000 (10ayounsi) Temporary AVOID-PATHS removed on cr2-eqiad. The maintenance is now completed, some notes: - It was not clear that the plan included removing OSPF on the trans-atlantic link... [21:59:02] 10Operations, 10netops, 10Patch-For-Review: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3674001 (10ayounsi) a:03ayounsi [22:08:22] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3674019 (10Dzahn) >>! In T177599#3673997, @Addshore wrote: > Is L2 enough to get into the ldap/nda group? Afaict, no, it's not. I was told L2 is... [22:13:01] 10Operations, 10Ops-Access-Requests, 10Analytics: analytics-privatedata-users access for Jeff Green - https://phabricator.wikimedia.org/T177602#3664354 (10Dzahn) @Jgreen Do you just need Hive/Hadoop or do you additionally need sampled webrequest logs and stat boxes with private data? Asking this way because... [22:14:47] 10Operations, 10Ops-Access-Requests, 10Analytics: analytics-privatedata-users access for Jeff Green - https://phabricator.wikimedia.org/T177602#3674030 (10Dzahn) Yea, aware Jeff has root on the mentioned stat boxes anyways, heh. [22:16:27] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3674032 (10Addshore) > any LDAP group membership means now being listed in the "admins" module in puppet and signing a different NDA with Legal.... [22:17:18] (03PS2) 10Aklapper: Phab: Allow aklapper to delete panels on dashboards [puppet] - 10https://gerrit.wikimedia.org/r/380959 [22:20:33] (03CR) 10Dzahn: [C: 031] Phab: Allow aklapper to delete panels on dashboards [puppet] - 10https://gerrit.wikimedia.org/r/380959 (owner: 10Aklapper) [22:21:00] (03CR) 10Dzahn: [C: 031] "maybe add that to the description too" [puppet] - 10https://gerrit.wikimedia.org/r/380959 (owner: 10Aklapper) [22:28:03] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3674044 (10Addshore) So. I think we might be okay to close this ticket now? [22:34:57] (03CR) 10BBlack: dnsrecursor: drop ganglia metrics support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382929 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:35:04] (03CR) 10BBlack: [C: 031] dnsrecursor: drop ganglia metrics support [puppet] - 10https://gerrit.wikimedia.org/r/382929 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:35:40] (03CR) 10BBlack: [C: 031] authdns: remove ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/382918 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:36:49] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 3 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3674078 (10Jdforrester-WMF) 05Open>03Resolved Provisionally marking as Resolved. [22:36:57] No train today? [22:37:04] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#3664202 (10RStallman-legalteam) For the past year or so, we have had WMDE staff requiring LDAP access sign an NDA with legal. @Pablo-WMDE: I ca... [22:39:39] eddiegp: sorry, Chad was going to do it but it appears he was delayed in getting back from his vacation :/ It was too late to start it when we realized this (it's mostly a one-person job, so we are used to just seeing it happen as all the others continue to work on their other work) [22:39:50] we'll catch up tomorrow [22:40:28] Okay, was just wondering. :) [22:57:41] (03PS3) 10EddieGP: wikitech: Align 'contentadmin' and 'sysop' permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171010T2300). [23:00:05] eddiegp and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:29] o/ [23:03:21] I can SWAT [23:04:04] i am here [23:04:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) (owner: 10EddieGP) [23:07:05] (03Merged) 10jenkins-bot: wikitech: Align 'contentadmin' and 'sysop' permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) (owner: 10EddieGP) [23:07:15] (03CR) 10jenkins-bot: wikitech: Align 'contentadmin' and 'sysop' permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) (owner: 10EddieGP) [23:08:07] (03PS3) 10Dzahn: annualreport: rm module, merge into profile, fix style [puppet] - 10https://gerrit.wikimedia.org/r/382351 [23:08:41] PROBLEM - Check whether ferm is active by checking the default input chain on ftp-internal is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:09:19] eddiegp: hrm, I'm unsure: can this be tested on mwdebug (is labstestwiki only on silver, too)? If so, it's on mwdebug1002, if not, let me know and I can deploy. [23:09:33] XioNoX: you must be doing the PDU upgrade:) [23:09:48] Nothing to test, it's wikitech (silver) only :) [23:10:26] mutante: indeed, I didn't know there was a icinga check :) [23:10:40] ok, going live [23:12:40] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:382024|wikitech: Align "contentadmin" and "sysop" permissions]] T171208 (duration: 00m 48s) [23:12:46] ^ eddiegp live now [23:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:47] T171208: contentadmin has suddenly less permissions - https://phabricator.wikimedia.org/T171208 [23:13:44] (03PS2) 10Thcipriani: Enable Vector print logo on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) (owner: 10Jdlrobson) [23:13:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) (owner: 10Jdlrobson) [23:13:52] thcipriani: Just checked it, works :) Thanks! [23:14:09] eddiegp: great, glad all works as expected, thanks for checking :) [23:14:40] (03CR) 10Dzahn: "compiles now and still has at least a delta of 1 violation http://puppet-compiler.wmflabs.org/8270/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/382351 (owner: 10Dzahn) [23:15:05] (03PS4) 10Dzahn: annualreport: rm module, merge into profile, fix style [puppet] - 10https://gerrit.wikimedia.org/r/382351 [23:17:03] (03PS5) 10Dzahn: annualreport: rm module, merge into profile, fix style [puppet] - 10https://gerrit.wikimedia.org/r/382351 [23:17:14] (03CR) 10Aklapper: "I didn't want to link to that example in the commit message as I don't want to potentially encourage folks to repeat that approach. It's n" [puppet] - 10https://gerrit.wikimedia.org/r/380959 (owner: 10Aklapper) [23:17:42] (03Merged) 10jenkins-bot: Enable Vector print logo on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) (owner: 10Jdlrobson) [23:17:50] (03CR) 10jenkins-bot: Enable Vector print logo on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383390 (https://phabricator.wikimedia.org/T177800) (owner: 10Jdlrobson) [23:17:56] (03CR) 10Dzahn: [C: 032] annualreport: rm module, merge into profile, fix style [puppet] - 10https://gerrit.wikimedia.org/r/382351 (owner: 10Dzahn) [23:18:34] jdlrobson: vector print logo on testwiki is live on mwdebug1002, check please [23:18:42] thcipriani: testing [23:19:21] thcipriani: sync away! [23:19:28] * thcipriani does [23:19:45] (03CR) 10Dzahn: "it's a complete no-op on bromine. the compiler changes are merely the source path to the templates changed" [puppet] - 10https://gerrit.wikimedia.org/r/382351 (owner: 10Dzahn) [23:21:24] (03PS2) 10Thcipriani: Disable OCG services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [23:21:29] !log thcipriani@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [23:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:56] huh well that's interesting [23:23:00] > No such file or directory in /srv/mediawiki/php-1.31.0-wmf.2/includes/libs/CSSMin.php [23:24:52] thcipriani: yeh that's what im trying to get to the bottom of [23:25:26] well this certainly seems to spike the logs a bit on that [23:25:30] https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 [23:25:40] dunno if you have any time to help me debug that by telling me what the value of $wgVectorPrintLogo is for that wiki? [23:25:53] thcipriani: even if just enabled on testwiki? [23:27:32] thcipriani: had similar issues with hashar yesterday [23:27:50] https://gist.github.com/thcipriani/22af47007be69f7b74cb4ec2ad7a8f59 [23:28:06] for testwiki [23:28:25] im guessing it needs an absolute url... [23:30:13] ok, so rollback for now? [23:30:49] thcipriani: i guess.. although none the wiser :) [23:31:59] (03PS1) 10Thcipriani: Revert "Enable Vector print logo on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383484 [23:32:10] (03PS1) 10Nuria: PageCreate events are no longer flowing [puppet] - 10https://gerrit.wikimedia.org/r/383485 (https://phabricator.wikimedia.org/T171629) [23:32:25] (03CR) 10jerkins-bot: [V: 04-1] PageCreate events are no longer flowing [puppet] - 10https://gerrit.wikimedia.org/r/383485 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [23:32:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383484 (owner: 10Thcipriani) [23:32:41] (03PS3) 10Nuria: Removing from whitelist tables that no longer exist [puppet] - 10https://gerrit.wikimedia.org/r/383185 (https://phabricator.wikimedia.org/T171629) [23:34:00] (03Merged) 10jenkins-bot: Revert "Enable Vector print logo on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383484 (owner: 10Thcipriani) [23:35:21] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: revert [[gerrit:383390|Enable Vector print logo on test wiki]] (duration: 00m 47s) [23:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:07] (03CR) 10jenkins-bot: Revert "Enable Vector print logo on test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383484 (owner: 10Thcipriani) [23:37:43] jdlrobson: https://gerrit.wikimedia.org/r/#/c/383391/ is live on mwdebug if there's anything to test there [23:38:15] mwdebug1002, that is [23:38:18] thcipriani: nothing to test... yet :) [23:38:27] *ominous tones* [23:38:29] thcipriani: the next change will test it - this just adds better error handling [23:38:38] k, going live :) [23:38:55] (03PS1) 10Herron: Add puppetcompiler1001 to dhcp and autoinstall configs [puppet] - 10https://gerrit.wikimedia.org/r/383488 (https://phabricator.wikimedia.org/T177843) [23:39:14] !log upgrading ps1-a2-eqiad - T175341 [23:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:21] T175341: Review and fix PDU settings for syslog/ntp/email servers - https://phabricator.wikimedia.org/T175341 [23:41:21] (03PS2) 10Herron: Add puppetcompiler1001 to dhcp and autoinstall configs [puppet] - 10https://gerrit.wikimedia.org/r/383488 (https://phabricator.wikimedia.org/T177843) [23:41:26] !log thcipriani@tin Synchronized php-1.31.0-wmf.2/extensions/Collection/RenderingAPI.php: SWAT: [[gerrit:383391|Do not request render if renderer not configured]] T177795 (duration: 00m 47s) [23:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:32] T177795: Remove references to OCG service from prod config - https://phabricator.wikimedia.org/T177795 [23:41:48] (03PS1) 10Jdlrobson: Enable Vector print logo and print styles on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383491 (https://phabricator.wikimedia.org/T169732) [23:42:05] (03PS3) 10Thcipriani: Disable OCG services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [23:42:38] (03CR) 10Herron: [C: 032] Add puppetcompiler1001 to dhcp and autoinstall configs [puppet] - 10https://gerrit.wikimedia.org/r/383488 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [23:43:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [23:45:50] (03Merged) 10jenkins-bot: Disable OCG services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [23:46:07] (03CR) 10jenkins-bot: Disable OCG services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383210 (https://phabricator.wikimedia.org/T177795) (owner: 10Jdlrobson) [23:46:45] jdlrobson: ^ is live on mwdebug1002, check please [23:47:01] thcipriani: on it [23:49:06] w00t sync away thcipriani [23:49:16] * thcipriani does [23:52:19] 10Operations, 10DC-Ops: Review and fix PDU settings for syslog/ntp/email servers - https://phabricator.wikimedia.org/T175341#3674278 (10ayounsi) [23:52:35] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:383210|Disable OCG services]] PART I T177795 (duration: 00m 47s) [23:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:41] T177795: Remove references to OCG service from prod config - https://phabricator.wikimedia.org/T177795 [23:53:54] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:383210|Disable OCG services]] PART II T177795 (duration: 00m 48s) [23:53:59] ^ jdlrobson live [23:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:19] thcipriani: w00t [23:54:38] kudos :)